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The Impact of Device Type and Sizing 
on Phase Noise Mechanisms 


Albert Jerng, Student Member, IEEE, and Charles G. Sodini, Fellow, IEEE 


Abstract—Phase noise mechanisms in integrated LC voltage- 
controlled oscillators (VCOs) using MOS transistors are investi- 
gated. The degradation in phase noise due to low-frequency bias 
noise is shown to be a function of AM-PM conversion in the MOS 
switching transistors. By exploiting this dependence, bias noise 
contributions to phase noise are minimized through MOS device 
sizing rather than through filtering. NMOS and PMOS VCO 
designs are compared in terms of thermal noise. Short-channel 
MOS considerations explain why 0.18-um PMOS devices can 
attain better phase noise than 0.18-;.m NMOS devices in the 1 / f? 
region. Phase noise in the 1/f? region is primarily dependent 
upon the upconversion of flicker noise from the MOS switching 
transistors rather than from the bias circuit, and ean be improved 
by decreasing MOS switching device size. Measured results on 
an experimental set of VCOs confirm the dependencies predicted 
by analysis. A 5.3-GHz all-PMOS VCO topology demonstrates 
measured phase noise of —124 dBc/Hz at 1-MHz offset and 
—100 dBc/Hz at 100-kHz offset while dissipating 13.5 mW from a 
1.8-V supply using a 0.18-4zm SiGe BiCMOS process. 


Index Terms—Flicker noise, phase noise, voltage-controlled os- 
cillator (VCO), WiGLAN. 


I. INTRODUCTION 


IGH data-rate wireless LAN applications are driving the 
AN continses development of highly integrated system-on- 
chip (SOC) solutions. Integrated voltage-controlled oscillators 
(VCOs) are essential components of such wireless systems. In 
this work, the VCOs are designed in the context of the MIT 
Wireless Gigabit Local Area Network (WiGLAN) project. The 
aim of the WiGLAN is to achieve a maximum 1-Gb/s data rate 
using 150 MHz of bandwidth in frequency bands allocated in the 
5-6-GHz range. An adaptive M-ary modulation scheme, up to 
256 QAM, is chosen to achieve higher data rates, imposing strin- 
gent accuracy requirements on the local oscillator (LO) signal. 
Thus, VCOs with low phase noise and high operating frequen- 
cies are required. 

The nonlinear and time-varying nature of an oscillator com- 
plicates phase noise analysis [1]. The existence of many sources 
of noise coming from several different frequencies makes it dif- 
ficult to discern which noise mechanisms are the dominant ones. 
Recent progress in VCO research has revealed that bias noise is 
an important contributor to phase noise [2]. High-frequency bias 
current noise has been observed to be a dominant contributor 


Manuscript received December 4, 2003; revised August 15, 2004. This work 
was supported by the MIT Center for Integrated Circuits and Systems. 

The authors are with the Massachusetts Institute of Technology, Cambridge, 
MA 02139 USA (e-mail: ajerng @mit.edu). 
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to phase noise [3], [4]. AM noise originating from the upcon- 
version of low-frequency bias noise cannot be neglected due to 
AM-PM conversion through the varactor [5]-[7]. Solutions to 
reduce bias noise have included filtering [4], [5], and removing 
the bias current generator [6]. Filtering requires extra inductors 
and capacitors. Removing the current source is also problem- 
atic, due to increased sensitivity to power supply noise, and vari- 
ation of the bias current over process and temperature. 

In Section III, we show that bias noise should not be treated as 
a fixed source of phase noise. Its phase noise contribution varies 
as a function of the switching transistor device parameters. We 
identify AM-PM conversion in the MOS switching transistors 
as a fundamental mechanism for the upconversion of low-fre- 
quency bias noise into phase noise and discuss how to minimize 
the upconversion factor without the use of filtering. 

It has been shown that thermal noise due to the switching tran- 
sistors in CMOS implementations is independent of MOS de- 
vice size and depends on bias current and y, the channel noise 
coefficient [4]. A review of reported VCOs reveals a wide range 
in the choice of switching device size and type (NMOS, PMOS, 
or both). In several cases, the choice of using PMOS transistors 
is made strictly for their lower 1/ f noise in that particular tech- 
nology [8], [9]. We have not seen in the literature any reasons 
given for choosing an NMOS device or a PMOS device with re- 
gard to 1/f? phase noise, largely because the devices have been 
assumed to yield equivalent thermal noise for equivalent g,,. In 
Section IV, we show why this assumption does not always hold 
in short-channel devices. As a result, lower 1/f*? phase noise 
can be achieved using PMOS devices instead of NMOS devices 
for a given gm. 

Previous work has stated that the tail current source is the 
primary source of 1/ f noise and that the contribution due to the 
MOS cross-coupled pair is made small by the switching action 
of the oscillator [10]. Ismail et al. [11] presented a design that 
removed the current source and also utilized a suppression tech- 
nique for the switching transistor 1/f noise. In Section V, we 
will show that reduction in 1/f* phase noise is fundamentally 
limited by switching transistor 1/f noise, and not tail current 
source 1/f noise. We outline a mechanism for 1/f noise up- 
conversion and show how to reduce this upconversion factor. 

Because phase noise measurements do not provide insight 
into the relative contributions of circuit noise sources, it is diffi- 
cult to confirm theories specific to particular noise sources or 
mechanisms with a single VCO design. In conjunction with 
simulations, we designed an experimental set of seven VCOs 
that enabled us to isolate particular noise mechanisms from one 
another. 
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Fig. 1. NMOS and PMOS VCO topologies. 


II. VCO EXPERIMENT 


The VCO topology used in the experiment, shown in Fig. 1, 
consists of a cross-coupled pair of NMOS or PMOS devices, a 
tail current source with associated current mirror circuitry, and 
a differential LC tank that uses standard pt /n~ junction diodes 
as varactors. This topology allows low-voltage operation and 
provides a convenient reference for the varactor to either power 
or ground, maximizing the voltage tuning range. The VCO cir- 
cuits were fabricated on a 0.18-j4m SiGe BiCMOS process with 
inductor quality factors (Q) of approximately 10 at 5 GHz. 

The phase noise of a VCO is fundamentally related to several 
key parameters. A semi-empirical model formulated by Leeson 
expresses this phase noise behavior as [12] 


_ 2kTRegF (_ fo 2 14 Shur ie 
A2 ZO Tern Nir 


According to this equation, the parameters that determine 
phase noise at a frequency offset f,, are the voltage swing, 
Aj, the tank impedance at resonance, R.,, the tank quality 
factor, Q, the excess noise factor, F’, the 1/f corner frequency 
of the circuit noise, Af,,/ss, and the oscillation frequency, 
fo. Ac, Req, Q, and f, were held constant in our experiment 
by designing each VCO with the same bias current and tank 
circuit, allowing meaningful comparison between the noise 
sources and mechanisms particular to each design. The bias 
current was set at 7.5 mA to maximize the voltage swing, ‘A,, 
under the constraint of gate oxide reliability. In this technology, 
A, was limited to 1.8-V,, differential. Key circuit parameters 
were then varied between the seven VCOs. MOS, NPN, and 
resistor current sources were implemented, NMOS and PMOS 
switching transistors were compared, and switching device 
parameters such as g,, and jf, were varied through device 
sizing. Finally, the tuning range was kept constant for all 
designs to maintain a controlled noise term due to AM-PM 
conversion in the varactor. 

A bias current mirror was used to generate the tail current, as 
opposed to applying an external voltage bias on the tail current 
transistor. Because we are particularly interested in evaluating 
the relevance of bias noise, it was essential to not leave out any 


L( fm) 





1.8V 





noise sources which may impact its magnitude. In the basic cur- 
rent mirror configuration with mirror ratio N > 1, the output 
bias current noise is actually dominated by the degeneration re- 
sistor in the emitter/source leg of the current mirror device. 


III. BIAS NOISE 


A general model shown in Fig. 2 illustrates the conversion of 
bias noise into phase noise as a two-step process. First, bias cur- 
rent noise i2 at frequencies w,, are translated in frequency by 
the switching action of the MOS cross-coupled pair. Low-fre- 
quency bias noise (w, < w,) mixes up to create two corre- 
lated sidebands at (w, + w,,) and (w, — w,), resulting in only 
amplitude modulation (AM) noise. High-frequency bias noise 
at W, = (2w, + Aw) downconverts into a single noise side- 
band in the passband of the LC tank, containing both AM and 
PM noise. The resulting output noise current 2, is then amplified 
and shaped by the positive feedback loop and LC tank filter. This 
fundamental process of an oscillator limits the output signal and 
suppresses amplitude noise, implying that only the PM noise 
components arising from high-frequency bias noise contribute 
to phase noise. However, the AM noise can potentially be con- 
verted into PM noise due to the presence of nonlinear compo- 
nents in the feedback loop. 

High-frequency bias noise is attenuated by the low bandwidth 
of the bias transistors and by the decoupling capacitor at the 
input to the current mirror. In our design, bias noise near 2w,, 
or 10 GHz, is more than an order of magnitude below the level of 
the low-frequency bias noise. Simulations run with and without 
a high-frequency bias noise filter yield identical phase noise re- 
sults. We conclude that high-frequency bias noise is not a sig- 
nificant phase noise contributor to our design. 

Amplitude variations due to low-frequency bias noise can be 
converted into phase noise through modulation of the varactor 
capacitance. However, this conversion is not fundamental to the 
design of a VCO. Proper choice of the VCO topology can miti- 
gate varactor AM-PM conversion. Minimizing varactor sensi- 
tivity (MHz/V) directly reduces varactor AM-PM conversion 
at the cost of a reduced tuning range. By adding in parallel to 
the varactor a bank of digitally switchable capacitors, overall 
tuning range can be regained without any increase in varactor 
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Fig. 2. Bias noise conversion into phase noise. 


sensitivity [2]. We designed our VCOs for a tuning range of 
400 MHz, corresponding to an average varactor sensitivity of 
approximately 200 MHz/V. A simulation replacing the varactor 
with an ideal capacitor showed little change in phase noise, in- 
dicating negligible impact of varactor AM-PM conversion on 
phase noise. 

Amplitude variations can also modulate the phase delay as- 
sociated with the MOS switching devices. The signal swing of 
an oscillator causes the operating point of the switching transis- 
tors to vary periodically. The MOS cross-coupled pair exhibits 
an AM-PM transfer function that is dependent on its operating 
characteristics as well as its source and load impedances. 

A criteria for oscillation is that the magnitude and phase of 
the loop gain are unity and 0°, respectively. Phase delays within 
the loop force an opposite phase shift in the LC tank to maintain 
0° phase. As a result, the oscillation frequency w, is shifted to 


[13] 
I ( 
1 
C 


Table I demonstrates how f, of a 60 m/0.18 wm NMOS 
VCO deviates from its expected ac simulation value of 
5690 MHz under large signal conditions. Transient analysis 
shows that as the amplitude of the oscillation increases through 
Ipias, fo shifts downward. The simulations indicate that the 
frequency shift is related to the level of the harmonic distortion 
(HD2, HD3) present at the VCO tank output nodes. This effect 
has been documented and described in [2] as a form of indirect 
FM. Second and third harmonics of the fundamental current 
component are generated by the switching transistors. When 
driven into the LC tank, these harmonics flow into the lower 
impedance of the capacitance and create an imbalance in re- 
active power. The oscillation frequency adjusts to compensate 
for the effective phase shift. Amplitude variations modulate the 
level of the harmonics, resulting in modulation of the phase 
shift. Variability in the phase shift results in variability in w,, or 
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DEPENDENCE OF f, ON Ipias: 60 xm/0.18 xm NMOS VCO 
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LS 5624 -31 -44 
2 5625 -33 -43 
4 5518 -24 -36 
6 5468 -20 -35 
8 5444 -19 -35 





phase noise. The phase noise due to this AM-PM mechanism 
can be expressed as 
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where Ow,/OIp, the sensitivity of w, to bias current fluctu- 
ations, and i2 (Aw), the magnitude of the bias current noise, 
can be extracted from ac and transient simulations, respectively. 
Equation (3) was used to calculate phase noise contributions 
due to this bias noise mechanism. The calculations matched 
SpectreRF phase noise simulations. 

In order to gain insight into how to minimize the AM-PM con- 
version factor, we focus on how device sizing can impact VCO 
harmonics. Increasing the linear range of the MOS differential 
pair, which is proportional to the gate overdrive V,, — V;, should 
reduce distortion for a given signal swing. However, at high fre- 
quencies, the harmonic distortion is also influenced by device 
capacitances. Minimizing device capacitance lowers high-fre- 
quency distortion. Thus, two convenient metrics for the linearity 
of the switching transistors are f; and V,, — V;. These quantities 
can simultaneously be increased by decreasing the device width 
of the switching devices since, for a fixed bias current, 
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In short-channel devices, as W becomes small, the depen- 
dence of f; on W becomes weaker and f; approaches a constant 
value. The V,, — V; dependence approaches 1/W for small W. 

Phase noise simulations using SpectreRF were run to confirm 
these relationships. We first concentrate on phase noise in the 
1/f? region, where flicker noise contributions are small. Figs. 3 
and 4 plot simulated NMOS VCO phase noise contributions 
from bias thermal noise and switching transistor thermal noise 
at a 1-MHz offset, as well as the total phase noise, as a function 
of the gate overdrive V,, — V;. Gate overdrive is increased by 
either reducing device width or increasing device length while 
keeping a fixed bias current. Reducing device width from 200 to 
10 ym reduces the bias noise contribution by 20 dB, making it 
negligible compared to the noise of the switching transistors and 
resulting in improvement of the overall phase noise. Increasing 
channel length from 0.18 to .6 jm increases gate overdrive but 
reduces f;, and results in a very slight increase in the bias noise 
contribution. The results indicate that both the gate overdrive 
and the f; of the differential pair are important for linearity. 


TABLE II 
EXPERIMENTAL RESULTS: MEASURED PHASE NOISE IN 1/ f? REGION 





















































Switching Device Bias Freq.(MHz) I 
NMOS 60/0.18 NPN 5390 -114 
NMOS 20/0.18 NPN 5530 -118 
NMOS 60/0.6 NPN 5350 -113 
NMOS 60/0.18 Resistor 5445 -118 
PMOS 200/0.18 PMOS 5020 -117 
PMOS 60/0.18 PMOS 5309 -122 
PMOS 30/0.18 PMOS 5320 -124 
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Fig. 5. NMOS VCO phase noise versus Viune- 

Notice that bias noise is the primary contributor to 1/ f? phase 
noise in both figures unless minimum length devices with rela- 
tively large values of (Vj, — V;) are used. 

Table II lists measured phase noise results from the exper- 
imental set of VCOs. An NMOS VCO’s device width was 
varied from 60 to 20 wm while a PMOS VCO’s device width 
was varied from 200 to 30 ym. No external capacitors or 
filtering of any type were applied to any nodes on the bias 
circuits. Improved phase noise in the 1/f? region (1-MHz 
offset) is observed for smaller device widths in both NMOS 
and PMOS designs. In order to prove that this improvement can 
be attributed to reduced bias noise upconversion, we fabricated 
an identical 60-jzm-width NMOS VCO with its bias circuit 
replaced with a resistor sized to yield the same bias current. It 
also showed improved 1/f? phase noise due, in this case, to 
the replacement of the current mirror bias noise with the much 
smaller noise contribution of a single resistor. An NMOS VCO 
with its length scaled from 0.18 to 0.6 jm had slightly degraded 
1/f? phase noise as expected. Finally, Fig. 5 plots measured 
phase noise for the 20-jum-width NMOS VCO at varactor 
tuning voltages ranging from 0 to 1.4 V. The three curves show 
little difference, even though the varactor tuning sensitivity is 
varying by a factor of 3 over this range. The impact of varactor 
AM-PM conversion on phase noise is minor. 

The degree of width reduction required for adequate bias 
noise suppression is a function of how noisy the bias circuit 
is. From a practical standpoint, reduction of device width is 
limited by two constraints. First, the loop gain of the VCO 
must be sufficiently greater than 1 to guarantee oscillation over 
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all operating conditions, leading to a minimum required value 
for gm. Second, the increase in Vj, — V; is limited by the 
voltage headroom required by the current source, and also by 
the supply voltage being used. 


IV. MOS CHANNEL THERMAL NOISE 


After reducing bias noise contributions, MOS channel 
thermal noise in the NMOS or PMOS switching transistors 
is the main source of phase noise in the 1/f? region. An 
expression for the MOS drain current noise spectral density, 
containing both thermal and flicker noise components, is 


AkT [her | Ky; ge 
pr Onlt+ Fie 


“Ox 


ea ae 1s ai i afl sa (6) 
where the net inversion layer charge Qy, the effective mobility 
Jet, and the channel length L, are calculated to include short 
channel effects such as velocity saturation. 

In order to compare the two device types in terms of thermal 
noise, the NMOS and PMOS cross-coupled pairs are designed 
to have the same negative resistance. This requires sizing the 
switching transistors to have equal g,,,. The PMOS switches are 
three times wider than the NMOS switches. The two test VCOs 
use the same tank and are biased at the same current, equalizing 
signal power. As long as bias noise contributions to phase noise 
are kept minimal through appropriate device sizing, phase noise 
differences in the 1/f? region can be attributed to differences 
in the drain current thermal noise between NMOS or PMOS 
switching transistors. 

Using first-order long-channel MOS theory, we can express 
the drain current thermal noise as 


= 2 
42 ath = 4kT (Jam) (7) 


According to (7), the NMOS and PMOS VCOs should exhibit 
the same phase noise performance in the 1/ f? region. 

Fig. 6 plots measured phase noise for NMOS and PMOS 
VCOs with device dimensions of 20 pm/.18 jim and 
60 ym/.18 jum, respectively. The PMOS VCO has ~4 dB 
better phase noise than the NMOS VCO in the 1/ f? region, in- 
dicating lower drain current thermal noise in the PMOS device. 
In order to understand this, the effects of velocity saturation in 
short-channel devices must be considered. The carrier velocity 
is a function of the horizontal electric field in the channel and 
can be modeled by the following piecewise equation [14], [15]: 


Pee L 
erg rae ee E<Eeo 
i E/Ec welts 
= E> Ec (8) 


where Fc is the critical field at which the carriers are velocity 
saturated and is equal to 2Usat / /Jeft- 

NMOS devices suffer from velocity saturation more than 
PMOS devices because their critical electric field, Ec, is much 
lower [16]. Velocity saturation is important when the channel 
length, L, is small, and the gate overdrive voltage, Vj, — V+, 
is high. When V,, — V; approaches the product LE, the 
equations for drain current and transconductance based on 
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Fig. 6. Phase noise: NMOS (20jm/0.18;:m) versus PMOS (60j:m/0.18j:m). 


long-channel theory are no longer valid. In velocity saturation, 
the transconductance asymptotically approaches 


Im = W CoxVsat- (9) 


Transconductance is no longer linearly proportional to V,, — 
V, at higher values of gate overdrive. On the other hand, the 
drain current thermal noise is proportional to Qy, the total in- 
version layer charge stored in the channel. In the general case, 
i 14, from (6) can be expressed as [16] 
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where 
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T is a function of the product LF, and bias conditions Vas. 
and V,, — Vz. According to (10) and (11), i 44), emains propor- 
tional to V,, — Vz. The amount of drain current thermal noise 
for a given g,,, can thus be related to the ratio ga, / Gm. This ratio 
is equal to one for low V,, — V; but increases as the device en- 
ters velocity saturation. Fig. 7 plots measured g,,, and gq. Curves 
versus Vj. for the NMOS and PMOS device sizes used in the 
test VCOs. The gao/gm ratio is clearly greater in the NMOS de- 
vice for most gate bias voltages. 

Simulations using BSIM3 v3 models confirm the degrading 
effects of velocity saturation. Fig. 8 plots 77,,,,, of 0.18-j.m 
NMOS and PMOS transistors versus g,, for a fixed bias cur- 
rent. gj is varied by changing the device width. On this plot, 
lower values of g,,, correspond to using smaller width devices 
biased at higher gate overdrive. The NMOS devices show 
higher output current noise for the same g,,,, with the difference 
becoming greater as g,, becomes smaller, or as Vj, — V; 
becomes larger. The devices in our two test VCOs operate at a 
Gm Of approximately 10 mS, where the ratio between NMOS 
and PMOS drain current thermal noise is about 2.3. 
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Phase noise simulations indicate that the drain current 
thermal noise contribution of the switching transistors is about 
4 dB less in the PMOS VCO than in the NMOS VCO, agreeing 
well with measured results. In order to evaluate this difference 
more fairly, we must consider an additional phase noise depen- 
dency. In this particular VCO topology, the Q of the NMOS 
tank is slightly lower than the Q of the PMOS tank. As shown 
in Fig. 1, the parasitic diode on the varactor cathode directly 
loads the NMOS tank. In the PMOS VCO, the parasitic diode is 
on a virtual ground and has no effect on Q. Since phase noise is 
proportional to 1/Q?, a degradation of 2 dB in the NMOS VCO 
is expected according to tank @ simulations. This suggests that 
the actual difference due to drain current thermal noise alone 
is about 2 dB. In order to confirm this, a simulation where the 
varactor was replaced with an ideal capacitor was run for both 
VCOs. The results showed a 2.3-dB difference between NMOS 
and PMOS drain current thermal noise contributions to phase 
noise. 


The improved phase noise performance derived from using 
PMOS switching transistors is a byproduct of the need to sup- 
press bias noise contributions. High current densities are re- 
quired in the switches to reduce the bias noise below the level 
of the switching device thermal noise. Under these bias condi- 
tions, VCOs using PMOS switches achieve significantly lower 
phase noise when using deep-submicron CMOS. 


V. MOS FLICKER NOISE 


1/f* region phase noise is caused by the upconversion of 
flicker noise from both the bias circuit and from the switching 
transistors. Applying the analysis from Section III, we can 
reduce upconversion of bias circuit 1/f noise by increasing 
the f, and gate overdrive voltage of the switching transistors. 
Switching transistor 1/f noise can be modeled with an equiv- 
alent noise current source with noise spectral density i? afl 
given by (6). Due to the 1/f spectral profile, this equivalent 
noise source is only significant at low frequencies. Referring to 
Fig. 9, we note that at low frequencies, switch flicker noise i afl 
sees a low impedance at its drain terminal. Approximating this 
impedance as a short, we can redraw 0 afl with this terminal at 


an ac ground. iP afl is now in parallel with the equivalent bias 
noise current source. Hence, switching transistor flicker noise is 
upconverted via the same AM-PM mechanism that upconverts 
low frequency bias noise. Reducing the device width of the 
switching transistors should also reduce its upconverted 1/f 
noise. However, there are two differences between the 1/f bias 
noise and the 1/f switching transistor noise. First, changing 
the size of the switching transistors affects the magnitude 
of i afl but does not affect the magnitude of the bias noise. 


Second, unlike the bias noise which is stationary, a? afl depends 
on the operating point of the switching transistors and varies 
periodically as a function of time. 

Fig. 10 plots simulated phase noise contributions at a 10-kHz 
offset as a function of V,, — V; while varying the device width 
of a PMOS VCO. While the 1/f bias noise contribution drops 
rapidly as V,, — V; increases, the 1/ f switching transistor noise 
contribution decreases more gradually, indicating additional de- 
pendencies. Optimization of the total phase noise at a 10-kHz 
offset requires the use of small device widths, in which case the 
flicker noise contribution from the switching transistors dom- 
inates over that from the bias. In the case of the bias flicker 
noise, one can simultaneously reduce both the magnitude and 
the upconversion factor of its noise, since the former involves 
sizing of the bias transistors while the latter involves sizing of 
the switching transistors. The current source transistor used in 
our PMOS design had device dimensions of 2000 jm/1 jum. 
On the other hand, sizing of the switching transistors involves 
a tradeoff between the 1/f noise magnitude and the 1/f noise 
upconversion factor. 

Experimental results confirm that in both NMOS and PMOS 
VCOs, reducing device width improves phase noise in the 1/ f° 
region (Table III). The best phase noise of —70 dBc/Hz at a 
10-kHz offset is achieved by the 30 zm/0.18 zm PMOS VCO. 
In this technology, the PMOS devices have lower 1/ f noise than 
the NMOS devices. Our experiment also compares the relative 
contributions to 1/f* phase noise between the bias transistors 
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TABLE I 
EXPERIMENTAL RESULTS: MEASURED PHASE NOISE IN 1/ f* REGION 








































Switching Device | Bias Freq.(MHz) qBe (10 kHz) 
NMOS oer 18 NPN 

NMOS 20/0.18 NPN 

NMOS 60/0.6 NPN 

NMOS 60/0.18 NMOS 

NMOS 60/0.18 Resistor 

PMOS 200/0.18 PMOS 

PMOS 60/0.18 PMOS 

PMOS 30/0.18 PMOS 


and the switching transistors. The 60 ~m/0.18 ~m NMOS VCO 
was designed with an NMOS current source, and with NPN and 
resistor current sources that have virtually no flicker noise. At 
a 10-kHz offset, the phase noise of the three VCOs is similar, 
indicating that the dominant contributor to 1/ f* phase noise is 
the 1/f noise of the switching transistors. 

The importance of the 1/f noise upconversion factor is 
evident when comparing the 60 ym/0.6 wm NMOS and 
60 m/0.18 zm PMOS VCOs to the 20 jzm/0.18 4m NMOS 
design. Although the 0.6-;zm-length NMOS device should have 
less flicker noise than the 20 jum/0.18 jzm device, it has 5 dB 
worse phase noise at a 10-kHz offset. Likewise, the models 
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indicate that the PMOS devices have less flicker noise than 
the NMOS devices. Instead, the 60 m/0.18 zm PMOS and 
20 ym/0.18 um NMOS VCOs have the same phase noise at 
a 10-kHz offset. The higher f; of the 20 m/0.18 zm NMOS 
device reduces AM-PM conversion and lowers the 1/ f upcon- 
version factor. 


VI. OPTIMIZED ALL-PMOS VCO TOPOLOGY 


Analysis of phase noise mechanisms has shown why PMOS 
switching devices provides better phase noise performance than 
NMOS devices in both the 1/ f* and 1/ f? regions ina 0.18-ym 
Lyin technology. Fig. 11 shows the measured phase noise for 
the optimally sized 30 jsm/0.18 zm PMOS design. Phase noise 
of —124 dBc/Hz is achieved at a 1-MHz offset and a center fre- 
quency of 5.32 GHz. The design operates from a 1.8-V supply 
and consumes 7.5 mA of bias current. For a tuning voltage range 
from 0 to 1.8 V, the VCO tunes 400 MHz, or approximately 8%. 

The all-PMOS VCO circuit also offers several advantages 
from a topology standpoint. It provides excellent isolation from 
power supply noise through the use of a PMOS current source 
to VDD, and a ground-referenced tank. Fig. 12 illustrates the ef- 
fect of power supply noise on the all-PMOS topology in contrast 
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Fig. 12. Effect of VDD noise on VCO topologies. 
Fig. 13. Effect of I,;,, noise on varactor de bias. 


to an NMOS topology. The NMOS circuit allows supply noise 
to couple into the oscillator feedback loop. In addition, low fre- 
quency supply noise directly modulates the voltage across the 
varactor in the NMOS case, inducing phase noise through FM 
modulation. A supply noise rejection simulation is run using 
the PXF analysis in SpectreRF. The periodic transfer function 
for low frequency noise from VDD to the VCO output nodes is 
20 dB less in the PMOS topology than in the NMOS topology. 
Finally, if one uses p*/n~ junction varactor diodes, the PMOS 
tank Q will be higher, as discussed in Section IV. 

The all-PMOS topology can scale down to lower supply volt- 
ages than the double cross-coupled topology, which uses NMOS 
and PMOS differential pairs and requires an additional V,, of 
voltage headroom. The extra V,, makes it difficult to bias the 
switching devices at a high gate overdrive for optimized phase 
noise. Fig. 13 compares the effect of bias current noise on the dc 
level of the varactors in the two topologies. Noise fluctuations 
on the bias current modulate the V,, of the bottom NMOS tran- 
sistors in the double cross-coupled topology. The de bias on the 
varactors varies, resulting in modulation of the varactor capac- 
itance. In the all-PMOS topology, bias current noise can mod- 
ulate the V,, of the PMOS switching pair but does not change 
the dc bias point of the varactors, which are referenced to ground 
through a low de impedance. 

The ground-referenced tank serves to minimize noise dis- 
turbances to the varactor, allowing the achievement of higher 
values of A... without degradation in phase noise. In summary, 
the all-PMOS VCO topology is desirable because it minimizes 
both intrinsic and extrinsic sources of phase noise. 
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Fig. 14. Comparison of normalized figure of merit versus f.. 


VII. COMPARISON OF RESULTS 


In comparing our phase noise results to other published re- 
sults, we normalize all phase noise data to a center frequency 
of 5.4 GHz and a frequency offset of 1 MHz. Fig. 14 graphs 
the normalized phase noise figure of merit (FOM) of this work 
and other published data against center frequency f, [3]-[6], 
[8]—[11], [17]-[27]. The graph shows a general upward trend, 
with the normalized FOM degrading as the oscillation frequency 
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increases. Our work lies at the bottom right corner of the graph, 
demonstrating excellent phase noise at a high oscillation fre- 
quency. Of the results with better phase noise performance, three 
use high Q external inductors while another one operates at a 
much lower center frequency of 1.2 GHz. In light of this paper’s 
analysis on AM-PM conversion in the switching transistors, we 
postulate that the degradation of phase noise performance at 
higher oscillation frequencies seen in this graph is due to an 
increase in low-frequency bias noise upconversion, which our 
design has specifically minimized. 

We emphasize that our results are achieved without adding 
additional on-chip LC bias filters or external decoupling capac- 
itors, maintaining the use of a standard current mirror and cur- 
rent source. Finally, although biasing at 7.5 mA optimizes our 
phase noise, it does not necessarily optimize our FOM. We can 
lower our power consumption by scaling down the bias current 
to 1 mA. At this bias condition, the measured phase noise at a 
1-MHz offset is —118 dBc/Hz, corresponding to an optimized 
FOM of —190 dBc/Hz/mW. 


VIII. CONCLUSION 


Several new concepts are proposed for the optimization, of 
phase noise. Switching transistor device width should be mini- 
mized to lower the upconversion factor for low frequency bias 
noise and switching transistor flicker noise. An important ben- 
efit of this is that bias noise contributions to phase noise are 
minimized without needing to add filters or remove the current 
source. The fact that 1/f* phase noise is improved through a 
reduction in the size of the switching transistors is counter-intu- 
itive and highlights the importance of the upconversion factor. 
PMOS switching transistors should be used instead of NMOS 
switching transistors because, under optimal bias conditions, 
they contribute less drain current thermal noise for the same gm. 
This results in improved phase noise performance in the 1/ f? 
region. An all-PMOS VCO topology using a ground referenced 
tank is effective in reducing the influence of secondary noise 
mechanisms such as the upconversion of bias noise through the 
varactor, and the upconversion of low-frequency supply noise. 

Key dependencies between phase noise and device parame- 
ters are derived from theoretical analysis. These relationships 
are confirmed in both simulations and measured phase noise re- 
sults taken from a systematic experiment consisting of 8 VCO 
designs. Proper device choice and device sizing are essential to 
the optimization of phase noise. 
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Analysis and Simulation of Spectral Regrowth 
in Radio Frequency Power Amplifiers 


Burcin Baytekin, Student Member, IEEE, and Robert G. Meyer, Fellow, IEEE 


Abstract—This paper presents a novel method for efficiently an- 
alyzing the relationship between spectral regrowth and physical 
distortion mechanisms in radio frequency power amplifiers. It uti- 
lizes a Volterra series model whose coefficients are computed from 
basic SPICE parameters. The analysis uses a decomposition of the 
Volterra kernels into simpler subsystems in order to greatly re- 
duce the computation times. The method is applied to the design 
of several bipolar-transistor power amplifiers after a series-based 
model is developed for representing the increase in active device 
forward transit time at high collector current densities. A number 
of single-stage SiGe power amplifiers have been designed, fabri- 
cated, and tested using the IEEE802.11b and IS-95 modulation 
schemes at different carrier frequencies, and these results are com- 
pared with the theoretical analysis. 


Index Terms—Adjacent channel power ratio (ACPR), distortion, 
forward transit time, power amplifiers, spectral regrowth, Volterra 
series. 


I. INTRODUCTION 


INEAR power amplifiers (PAs) are becoming widely used 
L in wireless communication systems with the rising popu- 
larity of code-division multiple access (CDMA) and orthogonal 
frequency-division multiplex (OFDM) systems. The envelope 
of the signals in these communications systems are not constant, 
so that the PA design for these systems must pay attention to the 
sources of nonlinearity in the PA in order to limit the amount of 
spectral regrowth, which can cause unacceptable levels of inter- 
ference in the adjacent channels. 

A power amplifier has to supply all of the radiated power at 
the transmitter antenna, as well as the power lost through the 
passive elements such as radio frequency (RF) filters or du- 
plexers. This makes the PA efficiency the dominant factor in the 
total power dissipation of the radio transmitter, which is espe- 
cially significant for mobile communication applications. 

The trade-off between efficiency and linearity leads PA 
designers to search for an optimum device operating point. 
Therefore, a good understanding of the effects of the transistor 
components and PA design parameters on linearity is essential. 
Linearity has long been analyzed by using intermodulation 
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distortion (IM3) as a metric, instead of the transmit spectrum 
mask, adjacent channel power ratio (ACPR), or error vector 
magnitude (EVM) specifications used in the wireless standards. 
The phenomenon of spectral regrowth has been analyzed in 
recent publications, but these treatments employ empirical 
methods which require curve-fitting a function (such as a real 
or complex baseband-equivalent power series) to the AM-AM 
and AM-PM simulations or measurements [1]-[7]. This ap- 
proach does not provide much insight into the relationship 
between the physical mechanisms in the circuit and spectral 
regrowth. Some of these analyses also assume that the input 
signals have a Gaussian amplitude distribution, although all 
of the conventional digital communications systems are based 
on the transmission of a set of discrete symbols with equal 
probability. 

As the carrier frequency in wireless systems can be 2 to 4 
orders of magnitude greater than the bandwidth of the signal, 
simulating the performance of PAs with properly modulated sig- 
nals is impractical with the conventional time-domain methods. 
This leads designers to using rule-of-thumb methods involving 
single or two-tone test simulations, although there is no simple 
relationship between the results of these tests and ACPR type 
specifications. 

In this paper, a novel method is proposed to predict spec- 
tral regrowth in PAs. The method uses basic SPICE parameters, 
which are based on the active device physical mechanisms, in 
order to make it applicable to transistors fabricated by different 
processes. First, a Volterra series model of the PA is calculated 
from the SPICE parameters, and the spectral regrowth is then 
predicted by using modulated signals. The decomposition of the 
Volterra kernels into simpler subsystems as proposed in Sec- 
tion III allows the combination of frequency and time-domain 
computations so that numerical results can be rapidly calculated. 
These results can assist circuit designers in understanding the ef- 
fect of design parameters on spectral regrowth and the trade-off 
between efficiency and linearity. A better understanding of these 
issues allows them to determine their initial design parame- 
ters more accurately before they initiate detailed simulations, 
helping them avoid time-consuming iterations. Identifying the 
transistor components contributing to distortion helps device de- 
signers optimize the power transistors. 

Section II of this paper presents a brief overview of the 
Volterra series and Section III explains how to obtain numerical 
results from the analysis. Section IV presents an implementa- 
tion example and Section V provides the results obtained, while 
Section VI concludes the paper. 


0018-9200/$20.00 © 2005 IEEE 
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Il. VOLTERRA SERIES 


If a nonlinear system does not have memory, the output can 
be expressed as a Taylor series 


y(x) = aya + aan” + a3x° (1) 


which models weakly nonlinear behavior reasonably well. How- 
ever, RF power amplifiers include circuit elements, such as ca- 
pacitors and inductors, whose impedances vary with frequency. 
This variation introduces memory into the system, which may 
be modeled by means of a Volterra series, as shown in (2) at the 
bottom of the page, where the functions h(7),72,..., T ) are 
the Volterra kernels of the system [8]. 

If a causal system described by a Volterra series lacks 
memory, then 


hiigaeros cas Tiree Op AOL VADIn Ty) FaiUs 


and (2) reduces to a Taylor series. 

In narrowband systems, the distortion products due to even- 
order kernels fall into frequency bands well removed from the 
desired signal as shown in the Appendix [9]. Hence, even-order 
Volterra kernels can be neglected for the analysis of spectral re- 
growth. Volterra kernels in (2) beyond the third are usually ne- 
glected in order to make a practical evaluation of y(t) possible. 
Although inclusion of the higher order terms would result in a 
more accurate representation, the accuracy of these kernels de- 
pend on the accuracy of the derivatives of the nonlinear func- 
tions in the circuit, which are difficult to determine precisely. 
In practice, useful results are obtained neglecting terms beyond 
third order and this approach is followed here. 


III. METHOD OF COMPUTATION 
A. Time and Frequency-Domain Calculations 


The input x(t) used in a typical communication system does 
not take on deterministic values, but is composed of a signal 
modulated by random data and a specified modulation scheme, 
requiring the generation of large number of bits for a proper sim- 
ulation. Furthermore, the carrier frequency in wireless commu- 
nication systems can be 2 to 4 orders of magnitude larger than 
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the bandwidth of the information-carrying signal or the enve- 
lope. Therefore, computation in the time domain requires too 
many samples per bit of data for practical circuit simulations. 
Frequency-domain calculations require far fewer samples, be- 
cause the computation can be limited to the frequency bands of 
interest in narrowband systems. Furthermore, taking the Fourier 
transform of the convolutions in the Volterra series reduces the 
order of the computation. For example, the first convolution in- 
tegral in (2), representing the linear portion of the system, is 
an O(N?) computation, while its Fourier transform results in a 
simple multiplication, which is O(.V ), as shown in (4). 


yi(t) = 7 hy(r)a(t — 7)dr (4a) 
¥i(f) = Hi(f)X(f). (4b) 


The three-dimensional Fourier transform of the third-order 
Volterra operator enables a similar frequency-domain computa- 
tion [10], shown in (5a)—(5b) at the bottom of the page. 

Equation (5b) can be used for intermodulation distortion 
(IM3) calculations, which involve only two tones. In this case, 
X(f) is nonzero at only four frequency values, two on the 
positive axis (f; and f2) and two on the negative axis (— fo 
and —f,). Therefore, the double integral in (5b) reduces to a 
few multiplications and additions. This has allowed designers 
in the past to perform hand calculations based on the Volterra 
series [11]. There are well-known procedures, such as the 
Bussgang method, allowing the computation of a symmetric 
H3(w 1, w2,w3) based on the SPICE parameters [12]. This is 
the method utilized for calculating the Volterra kernels in this 
treatment. 

As explained above, X (jf) is no longer a deterministic signal 
when an analysis of the spectral regrowth using modulated 
signals is desired. Even though frequency-domain computa- 
tion reduces the required number of samples and the order 
of the computation compared to the time-domain approach, 
implementing (5b) directly would still result in an O(N?) 
algorithm. The resulting computation times are too long to 
be practical, which sometimes lead researchers to assume the 
Volterra kernels to be constant, reducing the Volterra series to 
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a Taylor series expansion with complex coefficients [13]. How- 
ever, this ignores memory effects arising from second-order 
interaction terms which represent the mixing of linear signals 
with second-order distortion products if there is feedback or 
nonlinearities are cascaded in the circuit. 


B. Decomposition of the Volterra Kernels 


In the past, the long computation times have prevented the 
Volterra series from being utilized in a distortion analysis 
involving more than a few input tones. In order to dramati- 
cally reduce the amount of time it takes for the computations 
to be completed, a decomposition of the Volterra kernels is 
proposed in this treatment. A closer examination shows that 
H3(w1,W2,w3) for circuits can be broken down into parallel 
combinations of models resembling the ones shown in Figs. 1 
and 2 without using any approximations [9]. The former figure 
shows the pure third-order subsystem, while the latter one 
represents the second-order interaction. This decomposition is 
still an exact representation of the original Volterra system as 
shown in the Appendix , because the only source of memory in 
the PA circuit is the frequency dependence of the impedance 


of inductors or capacitors (whether their values are constant or 
voltage-dependent.) 

This decomposition allows the representation of the 
third-order nonlinear system with memory by a combina- 
tion of some linear blocks with memory and nonlinear blocks 
without memory. The computations involving the linear blocks 
represented by filters can easily be done in the frequency-do- 
main according to (4b). As the nonlinear blocks lack memory, 
cubing, squaring or multiplication of the signals can be done 
in the time-domain at a simulation carrier frequency f, much 
lower than the actual carrier frequency f,. This allows the 
compression of the spectrum as shown in Fig. 3. The com- 
bination of time and frequency-domain calculations allows 
the response of the circuit to be represented by a closed-form 
solution, instead of requiring numerical solutions of differential 
equations described in [14]. 

The compressed spectrum requires about 8 times more 
samples than the baseband equivalent version, but the order 
of the computation is reduced to O(N log NV) limited by the 
inverse-FFT and FFT calculations required before and after 
the nonlinear blocks. Therefore, all of the computations can 
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be completed in a very short time frame using a tool such as 
Matlab [15]. It is possible to reduce the processing time further 
by using a more direct programming language such as C if ease 
of implementation is not sought. 


C. Device Modeling 


The new computational approach described above was ap- 
plied to the design of PAs using bipolar transistors as the ac- 
tive element. The nonlinear bipolar transistor model used in the 
analysis is shown in Fig. 4. This model is applicable to Si and 
SiGe BJTs, as well as GaAs HBTs. The output impedance seen 
by a PA is generally small compared to r5 = J./Va4, where 
V4 is the Early voltage, so r, is neglected in the model. The 
parasitic capacitance between collector and substrate can be as- 
sumed small in a silicon-on-insulator (SOI) process or can be 
assumed constant and combined with the package parasitics and 
output matching network in more conventional processes. 

The parasitic collector resistance r. and emitter resistance 
re are assumed constant. Although the value of 7, is known 
to change at high base-current levels, its nonlinearity is usu- 
ally negligible compared to the other nonlinear elements in the 
transistor [12], so 7 is assumed constant as well. The transcon- 
ductance g,, = I./Vr is nonlinear due to the exponential rela- 
tionship between J. and Vgz. The base-emitter resistance r, = 
2/ 9m is also nonlinear, because of g,,. The variation of due 
to changes in the base-current is assumed negligible compared 
to the exponential behavior of g,,,, so 3 is assumed constant. 

Linear PA design requires that the power transistors are pre- 
vented from going into the saturation region. Therefore, it is 
assumed that the collector-base junction is never forward bi- 
ased for the analysis. Thus, the collector-base capacitor C,, in 
Fig. 4 is composed of some constant parasitic capacitance and 
the collector-base junction depletion capacitance. The latter ca- 
pacitance is the cause of nonlinearity in C,, and the analysis 
shows that its contribution to distortion is considerable due to 
the large signal swings across the collector-base junction during 
the PA operation. 

The base-emitter capacitance C,, in Fig. 4 is given by 


Cr, = t+ OF Oye Ra Cre (6) 
where C;. is the base-emitter depletion capacitance, C} is the 
diffusion capacitance and Tp is the forward transit time. Al- 
though C;.. is usually neglected or assumed constant, its value 
can become comparable to C;, and its nonlinearity can be sig- 
nificant when Vgz swing is large. Therefore, it is necessary to 
model the nonlinearity of Cj. for an accurate analysis. 


D. Modeling the Variations in Forward Transit Time 


The diffusion capacitance C;, also varies with Vgrz because 
of the variation in the transconductance g,,, and forward transit 
time Tp. The latter is usually assumed constant, but its value 
starts to increase and cause additional distortion at high collector 
current densities. In order to increase the accuracy of the anal- 
ysis, the series based model outlined below has been developed 
by the authors to take this variation into account. 


@ 
E 


Fig. 4. Nonlinear bipolar transistor model. 


The forward transit time 7 has four components 


TF = TE + TBE +TB+ TBO (7) 


where Tp is the emitter transit time, Tg¢ is the base-emitter 
depletion region transit time, Tg is the base transit time, and 
TBc is the base-collector depletion region transit time [16]. 

The first component of Tr affected by the increasing cur- 
rent density is usually tgc. When current is flowing through 
an npn transistor, the injected electrons are added to the nega- 
tively charged depletion region on the base side and subtracted 
from the positively charged depletion region on the collector 
side [17]. For a constant base-collector voltage, this requires 
the depletion region of the collector to be wider and Tgc to be 
larger. The depletion region of the base becomes shorter as well, 
but the effect is much less pronounced as the doping density in 
the base is several orders of magnitude higher than in the col- 
lector. If the current density is increased further, other mecha- 
nisms, such as base widening or Kirk effect, will be observed 
[16], [18], [19]. However, at this point, the rise in Tp is quite 
rapid and this operating region is not suitable for a linear PA. 
Therefore, only the variation in Tgc is modeled for this work. 

If the same basic assumptions outlined in [17] are followed, 
Tac can be calculated as 


_ qtaoNe it 
AN ede 1+ 


qNeVeat 


ok (8) 


TBC 


where q is the electron charge, “go is the width of the depletion 
region without current, NV, is the collector dopant density, J. is 
the collector current density and /,,; is the scattering-limited 
velocity of carriers. 

In this case, Tp can be expanded as a Taylor series 


Tr( VBE + Ube) 
Tao( VBE) 9 


=Tr(Vez) + T30(VBE) Ube + aia Dicumacls 
Tac(VeE) Pee 
6 € 
=tp(Var) + Krp, Vee + Krpo tes + Krp3, $ °°. (9) 


Thus, the current in C% is given by 


dQ Rr ond RCTS dtr 
Ee Here) Fe dt rite dt 
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Fig. 6. Die photo with the power transistor in the center and solder bumps around it. 
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The first part of (10) is the usual equation which governs the 
Volterra series representing the nonlinearity of the diffusion ca- 
pacitance when Tr is assumed constant. The second part is the 
series proposed for modeling the variations of Tgc. Equation 
(10) can be further simplified as 





; by dvbe i dv2 
bis (OmTF tr Televi ae 1 (hada a Dekel) - 
: fe dup, 
+(Kon9tr + IcKeps a $777 AD 


dt 


so that Tf variation model does not increase the computational 
complexity, once the extra K’,,, coefficients are calculated. 


IV. IMPLEMENTATION 


In order to compare the results of the analysis with measure- 
ments, a number of single-stage PAs have been designed for 
the IEEE802.11b wireless LAN standard operating at 2.4 GHz 
in the ISM band. The standard specifies the maximum output 
power level to be 20 dBm at the antenna, but the PAs have been 
designed to supply 24 dBm in order to accommodate the losses 
through the passive elements before the signal reaches the an- 
tenna. The transmit spectrum mask specifications require the 
spectral products in the adjacent sidelobe to be 30 dB below 
the main lobe, as shown in Fig. 5. 

The PAs have been designed using SiGe bipolar transistors 
and flip-chip packaging. Measurements have been taken using 
different values for input and output matching elements, bias 
current, supply voltage, as well as different number of transis- 
tors, which can be changed by means of a laser cutter. The die 
photo is shown in Fig. 6. 
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Fig. 7. Simplified schematic of the power amplifier. 


A. Details of the Circuit 


The simplified schematic of the common emitter PA is shown 
in Fig. 7. In order to reduce the number of nodes, the Norton 
equivalent of the signal source, input matching and local-bias 
circuit, shown in the dashed box, has been used in the analysis. 

The off-chip choke between the collector and Vcc passes 
the bias current, but has high impedance at RF frequencies so 
that almost all of the signal flows into the antenna through the 
output matching network. R¢ includes the parasitic emitter re- 
sistance, as well as the emitter degeneration resistance. The total 
value of Rg is adjusted so that there is about 50 mV de voltage 
drop across it, in order to prevent thermal runaway. Lg is com- 
posed of the on-chip wiring inductance, package and board par- 
asitics. The on-chip part is estimated by using Greenhouse for- 
mulas [20] and the off-chip parts are calculated based on a 3-D 
EM simulator. Both 2g and Lz improve linearity through se- 
ries feedback, but an inductor is preferred, as it does not limit 
voltage headroom as a resistor does. Inductive degeneration also 
increases the real part of the input impedance so that the input 
(or interstage) matching network can be designed with a lower 
quality factor (Q). 

The series feedback used to improve the linearity of the PA 
usually causes the third-order coefficient of the system to have 
a sign opposite of the first-order one. Thus, gain compression 
occurs at high power levels. One way to alleviate the gain com- 
pression problem is to allow the PA bias current to increase at 
high power levels through some modifications of the biasing cir- 
cuitry. The conventional biasing circuitry for a common emitter 
amplifier is usually based on a current mirror with a current 
helper as shown in Fig. 8 [21]. The ratio of the resistors tied 
to the base and emitter of Q and Q> must be adjusted to make 
sure the voltage drop across them and, in turn, the voltage drop 
across the base-emitter junctions are the same. The base resistor 
and capacitor Cp act as a filter to attenuate the input signal be- 
fore reaching the bias circuit. 

A large input signal swing increases the average value of the 
collector current over the quiescent value, due to the exponential 
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Fig. 9. Local biasing circuit without base resistor. 


nature of the bipolar transistor. However, this increases the base 
current and causes a bigger voltage drop across the base resistor, 
reducing the base-emitter voltage of the power transistor Q, and 
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Fig. 11. Input and output power spectral density calculated by simulations. 


limiting the increase in the average collector current. Therefore, 
the biasing circuit shown in Fig. 9 is preferred, as the lack of 
a resistor at the base of the power transistor allows a bigger 
increase in average collector current at high power levels. A 
further advantage is that the lower impedance seen by the base 
of Q, increases the device breakdown voltage. The values of the 
emitter and base resistors of Q»2 need to be adjusted so that the 
voltage drop across the base-emitter junctions of both transistors 
are equal to each other. The current mirror ratio depends on the 
actual value of /3, but de simulations has shown that the change 
in the collector current of 2, due to the process and temperature 
variations is still small. Cp also needs to be moved toward the 
base of (Qo, so that the signal is not attenuated before reaching 
the power transistor. 

Analysis shows that the amount of output voltage swing at a 
given output power level is quite sensitive to the imaginary part 
of the impedance at the collector. A small positive imaginary 
part at that node can easily increase the voltage swing across 
base-collector junction and push the transistor into saturation, 
even if the output return loss 59 stays acceptable. Therefore 
a transmission line, instead of ‘a discrete inductor, is used in 
the output matching network for more precise tuning. The typ- 
ical length of the transmission line required for a PA in a 50-2 
system is usually short enough to be realized on compact circuit 
boards. 





IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO, 2, FEBRUARY 2005 


Fe plete aetna eerie 
©- Measurements 

_=_ Analysis with constant t,. 
A. Analysis with t. model 


Ratio of First Sidelobe and Mainlobe (dB) 
& 
aoe Oo ae rd 


40 a , A ’ i 
17 18 19 20 21 22 23 24 25 


Roa (dBm) 
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802.11b modulation. 


V. RESULTS 


The simulation method using the Volterra-series-based 
power-amplifier model and the decomposition of the Volterra 
kernels has been implemented in Matlab. Baseband J andQ 
signals are generated from oversampled 1024-bit-long random 
data streams filtered according to the specified modulation 
scheme. The baseband signals are then upconverted to the 
simulation carrier frequency to generate the input waveform, 
as shown in Fig. 10. The input signal amplitude is adjusted 
according to the desired input power level and fed into the 
PA model, which includes the SPICE coefficients of the npn 
transistors and the design parameters, to generate the output 
waveform [22]. The power spectral densities of the input and 
output waveforms shown in Fig. 11 are generated by this 
method. The amplitude of the adjacent sidelobes relative to the 
mainlobe is much larger in the output than in the input due to 
spectral regrowth. 
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The measured ratio of the adjacent sidelobe and mainlobe 
versus output power level are compared with numerical pre- 
dictions in Fig. 12. An IEEE802.11b modulated waveform at 
a carrier frequency of 2.45 GHz is applied to a power amplifier 
which consists of 78 output transistors in parallel. The PA is op- 
erated at a supply voltage of 3.3 V and a bias current of 196 mA. 
At high power levels the current density becomes large and an 
assumption of constant Tp model results in underestimation of 
the spectral regrowth. The analysis including the Tp predicts the 
measured sidelobe growth to within 1.6 dB or better accuracy, 
while only requiring minimal computation time. The predicted 
and measured gain with IEEE802.11b modulated waveform at 
different power levels are shown in Fig. 13. 

The results of the analysis show that the increase in the base- 
collector depletion region transit time Tgc at high collector cur- 
rent densities can easily become the dominant source of non- 
linearity in a power amplifier. This problem can be alleviated 
by increasing the total emitter area of the PA at the expense of 
increased parasitics and lower gain. The analysis predicts that 
using 104 parallel output transistors would reduce the contri- 
bution of Tgc variation to negligible levels and improve spec- 
tral regrowth, which agrees with the measurements. The spec- 
tral regrowth can also be improved by some modifications to the 
power transistor, such as increasing the collector dopant den- 
sity or placing the highly doped buried layer closer to the col- 
lector-base junction, but attention must be paid by the device 
designers to make sure the device breakdown voltage does not 
become too low. 

A number of similar measurements have been taken from an- 
other PA with 78 parallel output transistors. The results of the 
measurements and analysis for this case can be seen in Fig. 14. 
The operating frequency in this case is 2 GHz, the quiescent cur- 
rent is 176 mA and the supply voltage is 3.5 V. The predicted 
spectral regrowth again differs by less than 1.5 dB compared 
to the measurements. Fig. 15 shows the results for the same PA 
supplying 24 dBm output at different average current levels. The 
measurements and predictions agree to within 0.6 dB. The trend 
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difference at very high currents is due to the onset of Kirk ef- 
fect and consequent modeling inaccuracies, which result in mea- 
sured distortion to be somewhat below the predicted value. 

The agreement between the measurements and analysis is 
similar when the same PA is operated with IS-95 waveforms 
at the same carrier frequency as shown in Fig. 16. The expected 
ACPR differs by less than 2 dB compared to the measurements. 

This analysis has also been used to compare resistive and 
inductive degeneration. It shows that in addition to providing 
more headroom and higher input impedance, inductive degen- 
eration improves spectral regrowth considerably as well. This 
result agrees with a similar prediction for IM3 improvement 
[23]. Another linearity improvement method is using a low-fre- 
quency-trap network [24]. However, the resistor used for pre- 
venting thermal runaway makes the improvement in spectral re- 
growth to be very small, in agreement with the analysis done 
for the LNA in [24]. It should also be pointed out that even if 
resistive degeneration is not used, the improvement in spectral 
regrowth through the use of a low-frequency-trap network is re- 
duced due to Tr variation. 
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VI. CONCLUSION 


A novel method of analyzing spectral regrowth based on the 
Volterra series and basic SPICE parameters has been developed. 
The proposed decomposition of the Volterra kernels into simpler 
subsystems have dramatically reduced the computation times. A 
series based model has also been developed to represent the in- 
crease in the forward transit time of bipolar transistors at high 
collector current densities. A number of single stage SiGe power 
amplifiers have been designed, fabricated and tested to validate 
the analysis. The computations based on this method provides a 
good insight into the relationship between spectral regrowth and 
the physical mechanisms in bipolar transistors. This can help 
circuit designers better understand the effect of design parame- 
ters on spectral regrowth, as well as the design trade-off between 
efficiency and linearity. In addition, it can help device designers 
optimize power transistors by better identifying the transistor 
components contributing most to distortion. 


APPENDIX I 
NEGLECTING EVEN-ORDER TERMS IN VOLTERRA SERIES 


The input x(t) to the system described by the Volterra series 
can be represented by an inverse Fourier transform 


ke 


X (fei? Ft df. (12) 
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The nth order frequency domain Volterra kernel of that 
system can be expressed as the n-dimensional Fourier trans- 
form of the time-domain kernel, h,,(71 Tr) 


—j2n(fitit- tint 


Tn Je dry +++ dT. 


(13) 


The nth term of the Volterra series can then be rewritten as 
(14), shown at the bottom of the page, if «(t — 7;) is replaced 
by (12) and 


ee 


by (13). 

Variable p,, can be replaced by defining f = pi+-:-+fn—1+ 
Pn: Please note that df = dp, and p, = f — o.. p;. Thus, 
(14) becomes (15), also shown at the bottom of the page. 


If (15) is compared to the inverse Fourier transform equation 


fi 


the Fourier transform of the nth term of the Volterra series can 
be calculated as (17), also shown at the bottom of the page. 
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Let us assume that a narrowband input signal is applied to 
this nonlinear system, such that the carrier frequency f, is much 
greater than the bandwidth of the signal Af. Hence, X(f;) is 
nonzero only for 


he|(-t.- F-n+ SF )u(n-shn+ SF) 


where Af < f,. Thus, the last term of the integral in (17) is 


nonzero only if 
Af Af AS 
2 2 2 


n—1 Af 
pPaoe ( fect, f+ 
t=] ( 1 8) 


Therefore, Y,,(f) is nonzero around f, only if 30"7)' fi 
is about zero. This requires (n — 1)/2 of f; terms to be in 
—fo —(Af/2),-—f. + (Af /2)) and the remaining (n — 1) /2 
terms to be in (f, —(Af/2), f. + (Af/2)). In order for 
(n — 1)/2 to be an integer value, n has to be odd. 





APPENDIX II 
THIRD-ORDER VOLTERRA SUBSYSTEMS 
A. Second-Order Interaction Subsystem 


Analyzing the second-order interaction subsystem shown in 
Fig. 2 is easier when the upper path is considered first. If the 
input and output of the squarer are called U(f) and V(f), re- 
spectively, the following relationships can be derived: 


U(f) = Ha(f)X(f) (19) 


and 


Vi =U(H «Uf = [UU =o)ap 


as a multiplication in the time domain is equivalent to a con- 
volution in the frequency domain. Applying this property once 
again, the output of the multiplier 7(f) can be calculated as 
shown in (21) at the bottom of the page. As Z(f) can also be 
defined as 


Z(f) 3 
eth is Hz(a,B—a, f —8)X(a)X(b-a)X(f —B)da dB 


(22) 


some simple change of variables allow Hz(f1, fo, f3) to be ex- 


pressed as 

Az(fi, fe, fs) = He(fi + fo) Ho( fs) Hal fi) Hal fo). (23) 
Unfortunately, this kernel is not symmetric. In order to get a 
kernel which does not depend on the exact order of the variables 
(fi, fe, fs), (21) needs to be rewritten as (24), also shown at the 
bottom of the page. The variables p and F in the first double 
integral of (24) can easily be replaced by a and (3, respectively. 
The variables p and f —F in the second one can then be replaced 
by a and (3 — a, respectively. Similarly, the variables f — p and 
f — F in the last double integral can be replaced by ( and a, 
respectively, making d@ = —dp and da = —dF’. After making 
the appropriate changes in the limits of the integrals, (24) can 
be represented by (25), shown at the top of the next page, and 
(23) becomes 


HA fa, fos fa) =5 (He Si + fa) Half )Ha(fi)Ha( fo) 


+ He( fit fs)Ho( fo) Half) Ha( fs) 
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As the output of the second-order interaction subsystem Y (f) 
is 


Y(f) = Ha(f/)4(f) (27) 


H3(f1, fo, f3) of this subsystem can be calculated as 


As(fi, fas fs) 
= HMA S9) (17-(f,+ fy) Hal Fs) Hal fa) Hal fo) 
+ He(fi + fs)Ho(f2)Ha( fi) Ha(fs) 
H.( fo + f3)Ho( fi) Ha(f2)Ha(fs)). (28) 
B. Pure Third-Order Subsystem 


A similar but simpler analysis can be performed for the pure 
third-order subsystem shown in Fig. 1. If the input and output 
of the cuber are called U(f) and V(f), respectively, V(f) can 
be expressed as 


VGH oC) aU) = Uae 
= ka a)U(B —a)U(f — B)da dB 
by using the property of the frequency convolution. Thus, Y ( /) 


is given by (30), shown at the top of the page. As Y3(f) is de- 
fined as 


(29) 


Y3(f) 
af. oh H3(a, B—a, f —B) X(a) X(8B—-a)X(f—B) dadg 
| (31) 


some simple change of variables allow H3(f1, fo, f3) of this 
subsystem to be represented by 


a ( fo) Ha( fs). 
(32) 


As(fi, fo, fs) = Ao( fi + fo + fs) Hal fi) 


When the third-order Volterra kernels encountered during an 
analysis of a circuit [12] is compared to (28) and (32), it can 
easily be shown that the original Volterra system can be decom- 
posed into parallel combinations of the subsystems resembling 
the ones shown in Figs. 1 and 2 without any approximations. 
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Noise in Mixed-Signal ICs 
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Abstract—Digital noise in mixed-signal circuits is characterized 
using a scalable macromodel for substrate noise coupling. The 
noise coupling obtained through simulations is verified with 
measured data from a digital noise generator and noise sensitive 
analog circuits fabricated in the 0.35-j4m heavily doped CMOS 
process. The simulations and measurements also demonstrate 
the effectiveness of including grounded guard rings and sepa- 
rating bulk and supply pins in digital circuits to reduce substrate 
coupling. 


Index Terms—Coupling noise, integrated circuit noise, mixed- 
signal noise, substrate coupling, substrate noise, supply noise. 


I. INTRODUCTION 


HE integration of digital, analog, and RF circuitry to create 
T systems-on-chips (SoCs) has become a reality in present 
day integrated circuits (ICs). Creating SoCs has major advan- 
tages including reduced size, reduced cost and lower power dis- 
sipation. However, this high level of complexity and integration 
causes noise coupling from the digital circuitry to the sensitive 
RF and analog circuitry. If the noise coupling is not addressed, 
it can result in significant performance degradation. The noise 
coupling occurs when the digital circuitry switches rapidly be- 
tween high and low voltage levels. Current spikes are created 
that couple through the power supply and the shared silicon sub- 
strate [1]. Several approaches to modeling the substrate and sim- 
ulating the digital noise have been developed [2]-[10]. Issues 
related to the proper inclusion of the package parasitics, back- 
plane connections, and noise suppression techniques have not 
previously been adequately addressed. 

This paper describes some of these issues and establishes 
guidelines for the simulation, measurement, and suppression 
of digital noise in mixed-signal integrated circuits. Section II 
presents background on the scalable macromodel used in this 
work [8], [11], [12]. This model serves as the foundation for 
validating the simulations with measurements. Section III sep- 
arates out the contributions of substrate noise coupling due to 
supply noise and transistor switching. The package parasitics 
are shown to play a key role in the total substrate noise coupling 
in mixed-signal ICs. The digital noise generating circuitry and 
the analog sensing circuitry used to verify the circuit level noise 
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Fig. 2. Typical cross-section for a heavily doped substrate. 

simulations are described in Section IV. Section V presents mea- 
surement results from a test chip fabricated in a 0.35-;.m heavily 
doped CMOS process and packaged in a 121-pin grid-array. 
The measurements validate that the simulation approach is very 
accurate. Based on the simulations and measurements, the ef- 
fectiveness of various techniques for reducing the digital noise 
coupled into analog circuits is determined. Finally, Section VI 
concludes the paper. 


II. SUBSTRATE COUPLING MACROMODEL 


For efficient simulation of large SoCs, a simple model that ac- 
curately predicts substrate coupling must be used. Approaches 
including finite element methods [1], [15], [16], boundary el- 
ement methods [3], [5], and polynomial curve fitting methods 
[17], [18] provide accurate post-layout simulation but they are 
computationally intensive particularly for full chip simulation. 
Additionally, they do not allow for pre-layout simulation. The 
substrate coupling model used in this work is scalable with con- 
tact shapes, dimensions, and separations [8], [11]. The substrate 
is modeled by a two-port lumped resistor network and it is valid 
for frequencies below a few gigahertz [2], [3]. The lumped re- 
sistive model for p+ to p+ contacts and n+ to p+ contacts 
is shown in Fig. 1(a) and (b), respectively. The resistance, R12, 
models the coupling between the two contacts and R11 and R22 
model the coupling from the contacts to the backplane. The n+ 
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Fig. 3. 
noise at the substrate. 


contact to p-type substrate junction capacitance is modeled by 
Gs 

The resistance values are determined by characterizing the 
substrate either through device simulations such as with the 
Medici simulator [19] or through measurements of the substrate. 
A typical heavily doped substrate profile is shown in Fig. 2 and 
consists of three distinct layers: a heavily doped p+ channel- 
stop implant, a lightly doped epitaxial (epi) layer, and a heavily 
doped p+ bulk [3]. The layer resistivities and thicknesses de- 
termine the substrate coupling (and resistance values) between 
contacts and to the backside. 

As the separation between contacts increases in heavily doped 
processes, the resistance between the contacts becomes very 
large. At separations beyond about 100 jm, nearly all of the 
current from the digital noise sources flows down into the sub- 
strate through the resistance to the backplane and then back up 
into the analog circuits when the backplane is floating. For this 


backplane input 
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(a) Setup to measure the total noise at the substrate. (b) Setup used to measure supply-only noise at the substrate. (c) Setup used to measure switching-only 























TABLE I 
SAMPLE SUBSTRATE RESISTANCES IN A HEAVILY DOPED CMOS PROCESS 
Separation (um) Ri Q) Rz2@) [Rn @ | 
10 390. 390 962 
50 305 ABODE An (my OS RGas 
100 exBOS Pesur Ce SEM 











reason, if the separation between digital and analog circuitry is 
greater than 100 jum, increasing the separation beyond this point 
provides only negligible improvement in the substrate coupling. 
Table I illustrates typical resistor values for two identical con- 
tacts at various separations in a heavily doped substrate. Notice 
that if these resistor values are used in the model shown in Fig. 1 
and the backplane is left floating, the resistors Ri; and Ry» are 
indeed the dominant contact coupling path. 

The scalable macromodel is based on Z-parameters from 
which the resistances can be derived or vice versa [12]. The 
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Fig. 4. Simulated plots of supply noise (top) and switching noise (bottom) for a stepped buffer. 


model is expressed in terms of 711, Z12, and Z2. Z12 is given 
by 


—Bxr 
Zi2 = ae 


where « is the distance between contacts and a and (3 are process 
parameters. 71; (72) is given by 


1 
~ K, Area + KoPerimeter + Ks 





Zi 


where K,,K2, and Ks are process parameters. Using this 
macromodel, substrate resistances can be obtained for an 
arbitrary number of contacts. 


Il]. DEPENDENCE OF POWER SUPPLY AND TRANSISTOR 
SWITCHING NOISE ON PACKAGE PARASITICS 


The power supply noise and transistor switching noise of a 
seven stage stepped buffer are simulated to illustrate the contri- 
butions of each to the substrate coupling noise. Stepped buffers 
are often a major source of substrate and supply noise in mixed- 
signal ICs as they provide buffering for clock signals as well 
as for output buffers to drive large off-chip capacitance. The 
stepped buffer consists of seven stages of inverters, with each 
successive stage loaded by two inverters sized a power of e 
larger than the previous, one is part of the seven stage stepped 
buffer and the other serves as a dummy load. The first inverter 
transistors are sized (W/L)n = 5 zm/0.6 pm and (W/L)p = 
10 zm/0.6 zm. The stepped buffer is designed and laid out in 
a 0.35-j1m heavily doped CMOS process and the resistive sub- 
strate network is extracted for this process and design. An in- 
ductance of 5 nH is included in the power and ground lines to 
model the effect of the bond wire and package inductance. The 
supply and switching noise generated by the stepped buffer are 
simulated using the approach in [7] and using the circuits shown 


TABLE II 
RANGE OF PIN PARASITICS FOR DIFFERENT PACKAGES [20], [21] 


Capacitance} Resistance 
(nH) (mQ) 
24 to 200 
100 to 400 
200 to 100¢ 
200 to 450 
32 ; 4 : 
64 i 













~~ 


165 to 190 
7 to 54 


“ecc | 32 | 1.1to147 | .08to0.11 
‘lip chip] 64 [0.26 to 1.5 [0.18 to 0.38 


in Fig. 3. Fig. 3(a) is the equivalent circuit used to simulate both 
the supply and substrate noise contribution. The substrate resis- 
tances are extracted from the macromodel or a boundary ele- 
ment solver. Figs. 3(b) and (c) are the equivalent circuits to sim- 
ulate the supply-only and substrate-only coupling. The substrate 
voltage is measured at the backplane node so that there is no de- 
pendence on the substrate contact size. The simulated results are 
shown in Fig. 4. The peak-to-peak value of the supply noise is 
three times larger than the peak-to-peak value of the switching 
noise. The dominance of the supply noise over the switching 
noise is due to the presence of the large supply inductance. 
The type of package used in the design of mixed-signal ICs 
and its particular parasitics can have a profound effect on the 
substrate and supply noise coupling. Several packages and their 
associated parasitic capacitances, resistances and inductances 
are illustrated in Table II. The stepped buffer is again simulated 
using the average values for each package in Table II and the 
results are shown in Fig. 5. When the rms value of the substrate 
noise is compared in all cases, it can be seen that the rms noise 
varies from 3 mV for no package model to 24 mV for the BGA 
which is a factor of 8 times larger. When the stepped buffer 
is simulated with the flip-chip and LPCC package models, the 
substrate noise is a factor of 3 lower than the BGA case. On-chip 
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Fig. 5. 


Graph showing the rms values of substrate noise generated by the stepped buffer for different package parasitics. The input frequency of the stepped buffer 


is 780 kHz. Source and bulk nodes of the transistors in the stepped buffer are connected to separate supplies. 


Cross-over inductance for different sizes of stepped buffer 
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Fig. 6. Affect of transistor sizing on the substrate noise coupling as a 


function of the supply inductance. The input frequency of the buffer is 
10 MHz. (W/L)nix = 1 pm/0.6 pm, (W/L)pix = 2 wm/0.6 pm, 
(W/E)nsx = 5 pom/0.6 pom, (W/L) psx = 10 pom/0.6 pm. 


decoupling capacitance can also be used to significantly reduce 
the supply noise. 

The value for which the supply noise dominates over the tran- 
sistor switching noise in a stepped buffer changes as the size of 
the stepped buffer changes. A second seven stage stepped buffer 
is designed that is one-fifth the size of the previous buffer. Both 
are simulated as a function of the supply inductance with an 
input frequency of 10 MHz and the results are plotted in Fig. 6. 
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Fig. 7. Measurement setup used for directly probing the substrate. Noise was 
measured via a p+ substrate tap connected directly to a probe pad. 


For the larger stepped buffer, the transistor switching noise dom- 
inates up to a supply inductance of 0.15 nH compared to 0.07 nH 
for the smaller stepped buffer. However, it is also important to 
note the substrate coupling noise is nearly an order of magni- 
tude higher for the large stepped buffer. This indicates that for 
the packages commonly used, supply noise will be the dominant 
contributor to the noise. 


IV. MEASUREMENT OF SUBSTRATE NOISE 


Three different noise-sensing methods were used to charac- 
terize the substrate coupling. The first method uses p+ substrate 
taps connected to pads that can be directly probed as shown in 
Fig. 7. This method is the simplest because it does not require 
additional on-chip circuitry for the measurement; however, it is 
generally not as accurate since the probe impedance may load 
the substrate. 

In the second measurement approach, a wide-band differen- 
tial output amplifier based on the design in [7] is used to buffer 
the substrate from the probe impedance. The amplifier, shown in 
Fig. 8, has one input connected to the substrate via a large MOS 
capacitor and the other input is connected to a separate quiet 
bias voltage. The input MOS capacitor is quite large, so at the 
frequencies of interest it acts as a short circuit. The amplifier has 
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input is sensitive to substrate noise while the other is connected to a quiet ground. 
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Fig. 9. Noise coupling measurement setup for the folded-cascode amplifier 
connected in a unity-gain configuration. 




















been designed so that a 50-22 impedance high-frequency probe 
can be connected at its output without changing the overall per- 
formance. Additionally, the amplifier is designed for a 700-MHz 
bandwidth, making it possible to perform continuous-time mea- 
surements of the substrate at high frequencies. The probes used 
to measure the output of the buffer amplifier were high band- 
width ground—signal—ground probes. Although the gain of the 
buffer amplifier is relatively low, approximately 3 dB, this does 
not limit the overall measurement as long as the amplitude of the 
digital noise is within the range of the measurement accuracy. 

The noise-sensing buffer amplifier layout was arranged to 
maximize the matching between the input transistors and load 
resistors so that the input-referred offset is minimized and the 
maximum amount of substrate noise is sensed. This is achieved 
by interdigitating the input transistors and load resistors and 
incorporating dummy transistors and capacitors into the arrays. 
Additionally, the common-mode rejection ratio (CMRR) is 
maximized in the design. Ground—signal—ground pads were 
placed above and below the opamp itself. This enables the 
routing traces from the circuit to the pads to be as short as 
possible, but still spaced far enough apart to meet the probing 
requirements. 

The third and final substrate noise sensing method involved 
the use of an analog building block that in this case was a folded- 
cascode amplifier connected in a unity-gain configuration [14]. 
The purpose of the operational amplifier is to demonstrate the 
application of the model and simulation approach for evalu- 
ating simple mixed-signal circuits. A block diagram illustrating 
the setup is shown in Fig. 9. By clocking the digital circuitry, 
noise is injected into the substrate and the power supply lines. 
This noise couples into the bulks of the input transistors and the 
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Fig. 10. Diagram depicting one stage of the digital stepped buffer showing the 
parasitics and the substrate network. Separate source and bulk power supplies 
were used. 


supply lines if they are shared. A seven stage stepped buffer, de- 
scribed in the previous section, is the noise injecting source for 
all the measurement cases. 

To accurately predict the digital noise coupled into the sub- 
strate, it is imperative that parasitic capacitances, inductances, 
and resistances associated with the package and the connection 
to the backplane be included in the overall simulation. Fig. 10 
shows the inclusion of critical parasitic elements as well as the 
substrate resistances for one inverter stage of a digital stepped 
buffer. These parasitic elements match those used for measure- 
ment of the test chip as described later. The package parasitics 
are obtained from the package model for a 121-pin grid-array 
package, whereas the substrate resistances were extracted using 
the scalable macromodel and are indicated by Ra-Rf. On-chip 
interconnect resistances, R1—R4, are also modeled and included ‘ 
in the simulations. An off-chip decoupling capacitor, Cd (and its 
parasitics ESR and ESL), is used in the actual measurements to 
reduce the supply bounce. 


V. EXPERIMENTAL RESULTS 


A test chip was fabricated in a 0.35-jsm heavily doped 
CMOS quad-metal, double poly process. It consisted of a 
stepped buffer, a folded-cascode amplifier connected in unity 
gain and two substrate noise sensing amplifiers as shown in 
Fig. 11. A single 3-V supply was used for all measurements. 
The stepped buffer was placed approximately 100 jum away 
from the folded-cascode amplifier and 400 j4m and 800 pum 
away from the two noise-sensing amplifiers. At distances above 
approximately 100 jum, the cross-coupling resistance between 
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Fig. 11. Microphotograph of the test chip containing the folded-cascode 
amplifier, noise-sensing amplifiers, and stepped buffer. 


Voltage (mV) 


T T T e T T T T T T 











measurement | 
i i i i i i i pitieety i 


0.4 1.0 1.6 ae 
Time (Us) 





Fig. 12. Simulation and measurement of the substrate noise picked up by 
the noise-sensing amplifier with the stepped buffer running at a frequency of 
780-kHz. 


circuit elements is so large (in the MQ range), that almost all the 


noise is transmitted down into the substrate and then back up 


into the circuit elements when the backplane is floating. For this 
reason, all the measurements from both of the noise-sensing 
amplifiers were identical. 

The transient behavior was measured using all three measure- 
ment techniques previously described. The noise-sensing ampli- 
fier provided a means by which continuous time measurements 
of the substrate could be made without loading the measurement 
with the probe impedance. 

Shown in Fig. 12 are the simulations and measurements 
at the output of the noise-sensing amplifier when the stepped 
buffer is clocked with a 3.3-V 780-kHz input waveform. The 
relative voltage peaks and the amount of ringing from both 
the simulations and the measurements are in good agreement. 
Fig. 13 shows simulations and measurements made using 
the p+ substrate tap as the means of measuring noise from 
the stepped buffer. In contrast to the results from the buffer 
amplifier, the substrate tap measurement amplitude is smaller 
and has less ringing due to the loading of the probe. In Fig. 14 
the measurement and simulation of the folded-cascode ampli- 
fier output in the unity-gain configuration is shown when the 
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Fig. 13. Simulation and measurement of the substrate noise sensed at the 
substrate tap with the stepped buffer running at a frequency of 780 kHz. 
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Fig. 14. Simulation and measurement of the substrate noise sensed by the 
folded-cascode amplifier connected in a unity-gain configuration. The stepped 
buffer was operating at 1 MHz. 


stepped buffer is clocked. Once again, the general shape and 
peak-to-peak voltage amplitude match very closely. 


A. Separating Supply and Bulk Connections in Digital Circuits 


When supply noise is dominant in digital circuits, it may be 
possible to reduce the noise by separating the transistor’s bulk 
and source power supplies. Under normal circumstances, a tran- 
sistor in a digital circuit will have its bulk and source nodes tied 
together on chip and taken to a single pin. By using separate pins 
for the bulks and sources, voltage bounce on the power supply 
lines is not fed directly into the transistor bulks. This may help 
reduce the amount of noise which is injected into the substrate. 
Figs. 15(a) and (b) illustrate the two different scenarios, where 
the bulks and sources are connected to a single pin, Fig. 15(a), 
and to separate pins, Fig. 15(b). 

Fig. 16 shows the simulation and measurement results for the 
noise picked up by the substrate tap when the stepped buffer’s 
bulks and sources are connected as shown in Fig. 15(a). By tying 
the sources and bulks together on chip, the peak-to-peak noise 
picked up approximately doubles as shown in Fig. 16(b). Our re- 
sults are different from those in [22] due to the smaller amount 






































by the substrate tap when the stepped buffer’s bulks and sources are connected 
to separate pins. (b) Simulation of the noise at the substrate tap when the bulks 
and sources are tied together and routed to a single pin. 


of digital circuitry in the test chip. For chips dominated by dig- 
ital circuitry, separating the transistor sources and bulks may 
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the routing resistance is only about 50 Q, this capacitor will not 
provide an on-chip short between the source and the bulk even 
for frequencies up to several hundred megahertz. 

Interestingly, the negative noise peaks remain relatively un- 
affected by the change, while the positive spikes triple in am- 
plitude for the case where the source and bulk nodes are tied 
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Fig. 18. Simulation (top) and measurement (bottom) of the folded-cascode 
amplifier with the die-perimeter ring left floating. 


together. The reason for this behavior was a large off-chip ca- 
pacitance that is being driven by the stepped buffer. Because the 
noise injected in the substrate is determined by the discharge of 
this large capacitance, this does not change when the bulk and 
supplies are separated. For smaller values of off-chip capaci- 
tance (or no capacitance) both peaks are reduced by separating 
the bulks and the power supplies. 


B. Grounding the Die Perimeter Ring 


When switching noise is dominant in heavily doped sub- 
strates, it can be reduced by grounding the substrate backside. 
Backside metallization is one method of grounding the back- 
plane. However, this is not standard and adds extra cost to 
production. On the other hand, die-perimeter rings are standard 
on many chip designs since they are often used in electrostatic 
discharge (ESD) protection schemes to reduce the ground 
resistance between I/O pads. On this test chip, a grounded 
die-perimeter ring has been used to ground the backplane 
since the measured resistance between the backplane and the 
die-perimeter ring is only 1.6 (2. A schematic of the setup is 
shown in Fig. 17. 

Figs. 18 and 19 show the simulations and measurements of 
the folded-cascode amplifier with and without the die-perimeter 
ring grounded [13]. The die-perimeter ring resistance to the 
backplane was measured to be 1.6 22 and this was used in the 
simulation. In both the simulations and measurements without 
die-perimeter ring grounding, the peak-to-peak noise voltage 
observed is around 55 mV. Contrasting this to the noise voltage 
of the grounded case, it can be seen that the peak-to-peak value 
is approximately halved and is now at 26 mV. 

Simulations shown in Fig. 20 summarize the effects of 
grounding the backplane and separating the bulk and supply 
connections for the PMOS devices. Separating the bulk and 
source nodes reduces the substrate noise by approximately 
7.5 dB at high inductance values, e.g., | nH. When the bulk and 
source nodes are tied together and the backplane is grounded, 
the substrate noise reduction on the backplane is 8.5 dB for low 
inductance values. There is approximately a 15-dB reduction 
in substrate noise when the source and bulk nodes of the tran- 
sistors are tied separately and when the backplane is grounded, 
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Fig. 19. Simulation (top) and measurement (bottom) of the folded-cascode 
amplifier with the die-perimeter ring grounded. 
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Fig. 20. Comparison of total substrate noise generated by the stepped buffer 
for different cases: bulk and source nodes tied together and separate, back plane 
grounded and not grounded. The input frequency of the buffer is 10 MHz. 


compared to connecting the source and bulk nodes and floating 
the backplane. 

Grounding the die-perimeter ring is an effective way to reduce 
digital noise. However, the die-perimeter ring must have a low- 
impedance path to ground to be effective. This can be achieved 
by connecting the die-perimeter ring to a bond-pad and then 
down-bonding. If a high impedance path (i.e., a long wire with 
significant inductance) is used to connect the die-perimeter ring 
to ground, the results may show no change or even an increase 
in noise coupling. 


VI. CONCLUSION 


An approach for simulating digital noise coupling has been 
discussed and verified using measurements from a test-chip fab- 
ricated in a 0.35-j4m heavily doped CMOS process. Measure- 
ments were shown for noise coupled from a stepped buffer to 
an analog noise-sensing amplifier, folded-cascode amplifier, and 
substrate tap. These measurements match closely with the sim- 
ulated results. Based on these measurements and simulations it 
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can be concluded that the macromodel gives a good approxima- 
tion of the noise that will be coupled to a given analog circuit. 
Additionally, noise suppression techniques have also been dis- 
cussed. Measurements and simulations for our test chip show 
that more than 6 dB noise reduction can be achieved by using 
separate digital bulk and source power supply pins, and more 
than 6-dB reduction can be obtained by using a die-perimeter 
ring connected to ground. 

The results of this work show that the choice of packages for 
mixed-signal chips greatly affects the amount of total substrate 
noise. Below package inductance values of 100 pH, noise cou- 
pling from MOSFETs in the cases presented here is dominant. 
Further reduction in package inductance beyond this point will 
only slightly reduce the total substrate noise generated. Most flip 
chip packages satisfy this criteria and are therefore an excellent 
choice for ensuring power supply noise coupling is not a factor. 
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High-Performance RF Mixer and Operational 
Amplifier BiCMOS Circuits Using Parasitic Vertical 
Bipolar Transistor in CMOS Technology 


Iku Nam, Student Member, IEEE, and Kwyro Lee, Senior Member, IEEE 


Abstract—The electrical characteristics of the parasitic vertical 
NPN (V-NPN) BJT available in deep n-well 0.18-um CMOS 
technology are presented. It has about 20 of current gain, 7 V 
of collector-emitter breakdown voltage, 20 V of collector-base 
breakdown voltage, 40 V of Early voltage, about 2. GHz of cutoff 
frequency, and about 4 GHz of maximum oscillation frequency at 
room temperature. The corner frequency of 1/f noise is lower 
than 4 kHz at 0.5 mA of collector current. The double-balanced 
RF mixer using V-NPN shows almost free 1/f noise as well as 
an order of magnitude smaller dc offset compared with CMOS 
circuit and 12 dB flat gain almost up to the cutoff frequency. 
The V-NPN operational amplifier for baseband analog circuits 
has higher voltage gain and better input noise and input offset 
performance than the CMOS ones at the identical current. These 
circuits using V-NPN provide the possibility of high-performance 
direct conversion receiver implementation in CMOS technology. 

Index Terms—BiCMOS, deep n-well CMOS, direct conversion 
receiver, offset, operational amplifier, parasitic vertical bipolar 
transistor, RF mixer, 1/f noise. 


I. INTRODUCTION 


OMPARED with MOSFET, the BJT (Bipolar Junction 
Ce devices have many desirable characteristics 
for analog applications including RF, namely, much smaller 1/ f 
noise, much better device-to-device matching, larger transcon- 
ductance, easier biasing, and easier impedance matching, and so 
forth. For this reason, RF and analog circuit designers usually 
prefer the use of BJT over MOSFET and most state-of-the-art 
radio chips have been fabricated using BiCMOS processes 
where the high performance vertical Si/Ge BJT is used for RF 
circuit and CMOS for logic [1]-[3]. However, the BiCMOS 
process has several drawbacks that the cost is expensive, the 
period of process development is long, the foundry service is 
very limited, and the performance of BiCMOS digital circuits is 
inferior to that of CMOS ones. As a result, this process may be 
unsuitable for the implementation of low cost single chip radio. 

On the other hand, continuous advances in CMOS technology 
provide both good RF circuits and digital VLSI at very low 
cost [4], [5]. Deep submicron CMOS process has been regarded 
very plausible to integrate digital modem blocks. In modern 
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wireless communication receivers, highest degrees of integra- 
tion are achieved with the direct conversion receiver (DCR). 
Therefore, the DCR’s realization in CMOS technology has ex- 
tensively been studied as a possible solution for low cost single- 
chip radio [6], [7]. However, CMOS DCR has the inherently 
serious problems of 1/f noise, de offset, //Q mismatch, LO 
(local oscillator) leakage, even order distortion, and so on [8]. 
Although, some of these can be alleviated by using novel circuit 
technique, careful layout, and compensation by digital signal 
processing, the 1/ f noise and dc offset problems have been crit- 
ical issues in CMOS analog circuits because MOSFET device 
has very large 1/f noise and mismatch in itself. These are es- 
pecially problematic for DCR and baseband analog (BBA) cir- 
cuits, which seriously degrade the overall sensitivity of CMOS 
receiver and raise an obstacle to its commercialization. 

Therefore, there have been many trials to use parasitic lat- 
eral BJT available in CMOS technology [9]-[14]. Because its 
base width is basically determined by the MOSFET gate length, 
very high current gain and unit current gain cutoff frequency are 
expected from scaled down CMOS technology. However, the 
uniformity, reproducibility, device matching, and driving capa- 
bility of these lateral devices are very questionable to be useful 
for practical purpose. In addition, there has been some effort 
to make use of the parasitic substrate vertical BJT available in 
double-well CMOS process [15]. However, the use of this tran- 
sistor is very limited since its collector is tied together to the 
substrate. Moreover, its RF performance is not satisfactory be- 
cause of thick well depth. . 

In this paper, we present the RF characteristics of parasitic 
vertical NPN (V-NPN) BJT available in deep n-well CMOS 
process [16] and the result of utilizing the V-NPN for low 1/f 
noise and dc offset RF mixer as well as for the simple one-stage 
operational amplifier in order to appraise the feasibility of high 
frequency circuits and BBA circuits using V-NPN. Deep N-well 
CMOS technology and parasitic V-NPN are briefly described 
in Section II. The RF characteristics of V-NPN are presented 
in Section II. The RF mixer and simple one-stage operational 
amplifier using V-NPN are described in Sections IV and V, re- 
spectively. In Section VI, we propose two methods to increase 
the operating frequency of V-NPN for DCR, followed by the 
conclusion in Section VII. 


II. PARASITIC V-NPN IN DEEP N-WELL CMOS 


Nowadays, most of the state-of-the-art CMOS foundries 
provide the triple deep n-well technology [17]. The cross 
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sectional view showing the well structure and various devices 
available from the deep n-well CMOS technology is presented 
in Fig. 1(a). The prime motivation for the deep n-well CMOS 
is that it is possible to apply different substrate bias to NMOS 
residing in other p-well so that we can adjust threshold voltages 
by electrical means, which is one of the most efficient ways 
to adaptively adjust power consumption. Moreover, this triple 
n-well CMOS technology, specifically deep n-well one, can 
provide excellent isolation against the substrate coupling noise 
among and between digital baseband logic circuits and RF 
and BBA circuits, which is especially important for integrating 
RF and baseband mixed mode circuits in a single chip. The 
deep n-well can completely isolate the p-well where NMOS is 
residing from the substrate coupling noise generated in other 
circuit blocks. 

It should be noted that we can obtain high performance 
V-NPN free from this CMOS technology as shown in Fig. 1(a). 
It is composed of the n+ source-drain diffusion as the emitter, 
the p-well diffusion and p+ contact as the base, and deep 
n-well, n-well diffusion, and n+ contact as the collector. Deep 
n-well V-NPN provides not only lower collector resistance 
but also thinner p-base width, both of which can lead to high 
BJT performance. Note that the V-NPN differs from the pre- 
vious parasitic substrate vertical BJT in that each collector is 
completely isolated, Since V-NPN has much better uniformity, 
reproducibility, device matching, driving capability, and more 


(b) 


(a) Cross sectional view of the deep n-well CMOS technology. (b) Layout for a V-NPN with four emitter fingers. 


ideal BJT characteristics than the lateral one, we expect that the 
availability of this device can give us a great impact for mixed 
mode circuits such as DCR. 


III. ELECTRICAL CHARACTERISTICS OF V-NPN 


V-NPNs with various number of emitter fingers (1 to 5) 
were laid out and fabricated in deep n-well 0.18-j:m 1-poly 
6-metal CMOS foundry process. The area of each emitter 
finger is 0.54 x 6.04 jm?. Fig. 1(b) shows the layout example 
for a V-NPN with four emitter fingers. The de characteristics 
of this device were measured with an HP 4156 semiconductor 
parameter analyzer. Fig. 2(a) shows the collector current (/c) 
versus collector voltage (Vcr) curves measured with varying 
base current from 10 yA to 40 A. 40 V of Early voltage, V4, 
is obtained by extrapolating the active region of the curves in 
Fig. 2(a), which is much larger than MOSFET. DC current gain 
of 18, BVcapgo (collector-base breakdown voltage) of about 
20 V and BVoro (collector-emitter breakdown voltage) of 
about.7 V are obtained. The Gummel plot is shown in Fig. 2(b). 
The curve of Fig. 2(c) shows that the current gain is almost 
constant over the wide range of collector current. At very low 
collector current, it depends on the collector current, indicating 
some nonideal base current characteristics. The maximum 
current gain of 18 is obtained at 22 A of Ic. Note, however, 
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that this dependence is much weaker than that in lateral NPN 
[13], showing much closer characteristics to an ideal BJT. 

To see high-frequency characteristics of the V-NPN, S-pa- 
rameters have been measured with HP 8510C network analyzer 
in the frequency range from 400 MHz to 6 GHz. The measured 
S-parameters were corrected for pad and interconnection par- 
asitic contributions by means of open and short de-embedding 
patterns. The de-embedded spectra for the current gain |h21|? 
and the MAG (maximum available gain) for V-NPN at 1.3 mA 
of collector bias current, are shown in Fig. 3(a). The unit current 
gain cutoff frequency f; is 1.9 GHz and the maximum oscilla- 
tion frequency fmax is 3.76 GHz. Fig. 3(b) plots the f; and fmax 
versus Ic, showing peak f; and fax are obtained near 1 mA 
of Ic for this particular device. The unit current gain cutoff fre- 
quency is approximately given by 


kT (Cy. + Cie 
eu 12" re yRENG ie Cie) 


) —1 
qle |} 


where Tp is the forward charge-control time constant, Cje is 
the emitter-base junction capacitance, C;, is the collector-base 
junction capacitance, k is Boltzmann’s constant, T is absolute 
temperature, and q is the electronic charge [18]. Fig. 3(c) shows 
1/f; versus 1/Ic characteristics. From the y-intercept of this 
plot, we obtain tr of 85 ps. Assume that the value of Tp is 
mainly dominated by the base transit time, Tg, expressed as 
follows: 


(1) 


Ta & W2/(2Dn) (2) 
where D,, is the diffusion constant for electrons, of which Boron 
is about 5.17[cm?s~'] at the given impurity concentration, and 
Wg is the base width [18]. The base width calculated from this 
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DC characteristics of V-NPN with four emitter fingers: (a) collector current (Jg from 10.44A to 40 yA in steps of 10 y1A); (b) Gummel plot; 


is Wg = 0.3 ym [see Fig. 16(b)], which is very close to the 
process data, indicating f; of this device is dominated by base 
transit time in vertical direction. Fig. 3(d) plots the peak jf, and 
fimax Of V-NPNs with various number of emitter finger. Regard- 
less of the number of emitter finger, f; and finax of V-NPN are 
about 2 GHz and 4 GHz, respectively. Also, this indicates that 
the high-frequency characteristics of V-NPN depend on not the 
parasitics due to the layout dependence but the base width. 

Because V-NPN is a parasitic device, there is a concern for 
its uniformity. Therefore, we measured the parameters such as 
3, V4, output resistance (r,) and f; on 30 samples of V-NPN 
with four emitter fingers fabricated in a same wafer under the 
same conditions as above. Fig. 4 plots the histograms of these 
parameters over samples. As shown in Fig. 4, V-NPN shows 
excellent uniformity within wafer of less than 3.7% for all the 
parameters studied in this paper. 

On the other hand, the flicker noise of the V-NPN was mea- 
sured with the low noise current preamplifier and spectrum an- 
alyzer. As shown in Fig. 5, the corner frequency of flicker noise 
for V-NPN is as low as 4 kHz at 0.5 mA of collector current. 
In contrast, the corner frequency of 20 m/0.18 ~m NMOS 
is about 3 MHz at the same current. As expected, the V-NPN 
has much better flicker noise performance, indicating the fea- 
sibility of mixer and BBA circuits fabrication with almost free 
1/f noise. 


IV. RF MIXER FOR DCR USING V-NPN 


The output noise voltage of the down-conversion mixer using 
MOSFET for DCR can be calculated as Véi;4 = Vour.nr + 
re r9 Vr ‘ r ‘ 
Vour.nws + Vour.p as shown in Fig. 6, where J our-nT is the 
noise generated in the transconductor, Np, Vour.ws is that in 
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RF characteristic of V-NPN: (a) the current gain |h2,|?, and the maximum available gain (MAG); (b) cutoff frequency (f;) and maximum oscillation 


frequency (fimax) versus collector current (Ic); (c) 1/f, versus 1/Ic plot showing base transit time of 85 ps; (d) peak f; and fmax of V-NPNs with various 
number of emitter finger with unit finger area of 0.54 j1m x 6.04 jum. All data are measured at Vop = 1 V. 


the switch, Ns, and Vou. is that in the load resistor, R. Here, 
Vour.nr can be expressed as 

TaD AT, a ee aie r 12D 

Venatae = 2x (4kT y9a0-NTGY-/G>n.NT) Af (3) 
where gao.n7 is the drain conductance of Nz at Vps = 0 V, 7 
represents the ratio of the value of thermal noise at any given 
drain bias to the value of thermal noise at Vps = 0 V [19], 
Gy = 2gm.nrR/n is the voltage gain of the mixer, gm.nT 
is the transconductance of Ny, Af is the bandwidth in hertz, 
and the factor 2 results from the two Ny’s. The output noise 
voltage spectral density due to the switching pair and load re- 
sistor, Vou-7.ng and V21,-p. p, can be expressed as 


Vaur.ns = 4% [4kT y9a0.nsR? 
+ Kg ngR?/(CoxWnslnsf)| Af, 4) 


and 


) 
Vout.r = 


2x (4kTR)AS (5) 
respectively. Here K is a process-dependant constant for 1/f 
noise (see Fig. 5), Co, is the gate oxide capacitance per unit 
area, Wyg is the width of Ns, Lys is the channel length of Ng, 
the factor 4 in (4) comes from the four Ng’s, and the factor 2 in 
(5) comes from the two /?’s. 

As shown in (4), the low-frequency noise is dominated by 1/ f 
noise. Thus, we expect very small low-frequency noise in the 
mixer adopting V-NPN in the switching pair. To demonstrate 


this, we designed and fabricated a double-balanced RF mixer 
for DCR using V-NPN introduced in Section III, as shown in 
Fig. 7. Note, however, we still use NMOS (80 jm/0.18 jum) 
transconductors, because it provides higher linearity and gain 
with 1 mA of total mixer core current. The chip photograph is 
shown in Fig. 8. In order to minimize the parasitic capacitance 
Ces between the collector and the substrate, the collectors of 
V-NPN switching transistor pair Q; and Q3, and Q» and Q4 
were shared, respectively. The RF mixer was laid out as sym- 
metrically as possible. 

The measured conversion gain versus RF frequency is shown 
in Fig. 9. For the measurement, IF frequency is chosen at 1 MHz. 
When the RF frequency is over 2.4 GHz, the conversion gain de- 
creases. It is very interesting to note that this mixer’s 3-dB cutoff 
frequency is about 2.4 GHz, which is higher than the maximum 
fF; of 2 GHz. We believe that this is.due to the frequency doubling 
effect of the differential circuits [20]. This fact is quite an encour- 
aging result and is thought to be the characteristics of double-bal- 
anced mixer. Fig. 10 plots the IP3 measurement results when two 
tones at 902.5 MHz and 903.5 MHz are mixed with LO frequency 
of 900 MHz and two tones at 2102.5 MHz and 2103.5 MHz are 
mixed with the LO frequency of 2100 MHz, respectively. I1P 3 is 
measured as —3.2 dBm and —5 dBm. 

Fig. 11 presents the measured noise figure. As expected, the 
mixer has excellent low frequency noise performance, showing 
only thermal noise and almost 1/f-noise-free characteristic. 
Therefore, the RF mixer using V-NPN switching transistors 
can be used even in very narrowband DCR such as for GSM. 
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Fig. 5. Measured output noise spectra of V-NPN with four emitter fingers and 
NMOS of 20 jum/0.18 jem at 0.5 mA. The solid lines are 1/ f noise models 
fitted with K,, = 3 x 1071° and K, = 4 x 1073. 


The output de offset voltage of the mixer using V-NPN 
switching pair is shown in Fig. 12 measured as a function of 
LO input power, zero power limit of which is 0.6 mV. On the 
other hand, typical value for that of the mixer using NMOS 
switching transistors (aspect ratio; 50 jum/0.18 jzm) is measured 
as 5-10 mV. This order of magnitude improvement is due to the 
much better device-to-device matching characteristic of V-NPN 
compared with NMOS device. Fig. 12 shows that the dc offset 
voltage increases as the LO input power and the LO frequency 
increase, as it should do because of the LO self-mixing. 

Table I compares the performances of the V-NPN mixer 
against those of other published CMOS mixers. Clearly, we 


can obtain eminent noise figure and IIP2 performance in 
the V-NPN mixer due to V-NPN characteristics such as low 
1/f noise and good device-to-device matching. The parasitic 
V-NPN in deep n-well CMOS process can provide good enough 
mixer performance, opening a new horizon for low-cost CMOS 
DCR. 


V. OPERATIONAL AMPLIFIER USING V-NPN 


In addition to RF front-end, BBA circuits are also an im- 
portant part in the wireless communication circuits. An oper- 
ational amplifier is an essential part of BBA circuits such as 
active RC filter, programmable gain amplifier, etc. CMOS op- 
erational amplifiers (op amps) suffer from many problems such 
as large 1/f noise, large input offset voltage, and so forth. At 
low source impedance, the equivalent input noise voltage of 
one-stage CMOS op amp in Fig. 13(a) is expressed as [21] 


Ven = 2{4kT(2/3)/9mi + Ky /(CoxWi Lif) 
tr Gn3/ Frnrl4kT (2/3) /9m3 oP Kp/(CoxW3L3f)]} Af. 


(6) 


The equivalent input noise voltage is mainly dominated by that 
of the differential NMOS input pair. As can be seen from (6), in- 
creasing the gate area of the input transistors can reduce the 1/ f 
noise. However, its unavoidable penalties are greatly increased 
area and large input capacitances, both of which inevitably in- 
crease die size as well as the power consumption [14]. 

The alternative to large gate area of the NMOS input transis- 
tors is to adopt BJT in the input stage. To assess the feasibility of 
using V-NPN in BBA circuits, a simple one-stage differential op 
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Fig. 6. The output noise voltage spectral density of double-balanced Gilbert mixer using MOSFET. 





RF - 





me ULL ls 
Ts aa Lia 


Fig. 8. Chip photograph of RF mixer using V-NPN switches. 


amp has been designed, as shown in Fig. 13(b). The equivalent 
input noise voltage of one-stage V-NPN op amp in Fig. 13(b) is 
expressed as 





V2, = 2{4kT ry + 4kT/(2gmv) 


hy G-n3/ Grav 4kT (2/3)/9m3 tt Kp/(CoxW3Lsf)|} Af 
(7) 
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Fig: 9. Measured conversion gain versus RF frequency. 
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Fig. 10. IP 3 plot measured at LO input power of —8 dBm. The IIP3 is 
—3.2 dBm and —5 dBm, respectively. 


where 7, is the base resistance of Q;. Because V-NPN has 
much larger transconductance gn.(= glc/kT), smaller 1/f 
noise than MOSFET, we expect much better noise performance 
through (7). Moreover, because the Early voltage V4 and the 
output resistance r,,,(= V.4/Ic) are larger, much larger voltage 
gain |Ay| © gmv(Tov//703) can be obtained at the same bias 
current. The only significant disadvantage of V-NPN op amp as 
compared to a CMOS one is the input bias current. The equiv- 
alent input noise current of a CMOS one is usually negligible 
due to very small input bias currents. However, the V-NPN op 
amp has a significant input noise current J,, generated by the 
base currents of the V-NPN input transistors. 
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TABLE I 
MEASURED PERFORMANCE SUMMARIES OF RF MIXER USING V-NPN AND COMPARISON TO OTHER CMOS MIXERS ALREADY PUBLISHED 
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ree || Pe Le er ace 
Conversion gain 12dB 12dB 15dB 18dB 
IIP> > 40dBm > 30dBm 44dBm 30dBm = 
IP; -3.2dBm -5dBm -8.2dBm -4dBm 
NF (DSB) 7.6dB 8.5dB 17.8dB 18dB(SSB) _ 
Power consumption | _0.9mA @1.8V 0.9mA @1.8V_ | 0.73mA @3V | 6mA @2.7V ‘ 
Technology 0.18um CMOS | 0.18um CMOS | 0.35um CMOS | 0.35um C MOS. 
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Fig. 11. Noise figure measured at 0.9 GHz and 2.1 GHz. 


On the other hand, the input offset voltage of the CMOS op 

















amp, Vos.n, and that for the V-NPN op amp, Vos.,, can’ be 
approximated respectively as [22] 
Vos.n © Vrui — Vru2 + (Vru3 — Vrua) 
Mp(W/L) p(1 + |ApVosp|) 
bin(W/L)n(1 ef Nee Vpsn) 
a Tyr 
“2 Bags (W/L)n(1 + AnVpsn) 
A(W/L)p  A(W/LD)n 
x, ( SUES viet (8) 
(W “(W/L)p )p (W/L) 


and 


‘ - | A(W/L)e 
Vos.v © Vr 


W/L)p 
Y Lp Cox(W/L) p(1 + |ApVosp}) 
Tyr /4 








+ (Vru3 — Vrua) 





Als 
Is 





) (9) 


Here Vy is the threshold voltage, (W/L) (W/L, + 
W2/L2)/2 is the combined W/L of M; and Mg,(W/L)p = 
(W3/L3 + W4/L4)/2 is that of M3 and Mg, A is the channel 
length modulation coefficient, Vpsn is the drain-source voltage 
of Mi and Mz, Vpsp is the drain-source voltage of M3 and 
My, A(W/L)n = Wy / Ly — Wo/L2, A(W/L) p = W3/L3 — 
W4/L4, tn is the mobility of electrons, j1, is the mobility of 
holes, Vr is the thermal voltage, Jz = (Ig1 + Is2)/2, Als = 
Ig ,—Is2, Is, is the scale current of Q;, and Js2 is the scale cur- 
rent of Q»2. Note that (9) is derived here following similar pro- 
cedure for (8). Because the effect of V7 in (9) can be scaled 
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Fig. 12. The dc offset voltage of RF mixer using V-NPN switching pair versus 
input power level. (The data indicated by an error bar is the range of the de offset 
measured from NMOS mixer fabricated using same CMOS technology). 


by A(W/L)p/(W/L)p, it can be known that Vos.,, would be 
much smaller than Vos.,, from the (8) and (9). 

The chip photograph of the fabricated V-NPN op amp is 
shown in Fig. 14. Table II summarizes the performance of 
CMOS op amp and V-NPN op amp. The V-NPN op amp has 
the voltage gain of 58 dB, equivalent input noise voltage (V,, ) 
of 2.9 nV//Hz with 1/f corner frequency (f,,) of 1.9 kHz, 
and equivalent input noise current of 0.7 pA//Hz with f,, of 
1.8 kHz. Especially, V-NPN op amp has two order of magnitude 
lower f,, and smaller V,? than CMOS one at the same current. 
Furthermore, its input offset voltage is about | mV, which is 
much smaller than that in CMOS. The input base current of 
V-NPN differential pair is 1.54 A, respectively. The input 
offset current between V-NPN differential pair is measured 
about 5 nA using HP4142 B. Since V-NPN device-to-device 
matching is excellent, the impact of input offset current is 
negligible. 


VI. WAYS TO INCREASE OPERATING FREQUENCY OF V-NPN 


As stated above, it is known that the RF mixer and operational 
amplifier using V-NPN are much robust against the low-fre- 
quency noise and mismatch, both of which are vital to DCR. For 
example, the utilization of V-NPN as shown in Fig. 15 makes 
high-performance CMOS DCR possible. Also, by combining 
V-NPN and MOSFET devices on the same chip, we can opti- 
mize the analog/digital circuits and maximize the tradeoff be- 
tween speed and power. Therefore, V-NPN can give impact on 
the implementation of high-performance CMOS DCR as well 
as system-on-a-chip. 
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Circuit schematic diagram of (a) one-stage CMOS operational amplifier and (b) one-stage V-NPN operational amplifier. 


TABLE II 
PERFORMANCE SUMMARIES OF CMOS OPERATIONAL AMPLIFIER AND V-NPN OPERATIONAL AMPLIFIER 



































aie ea see apne V-NPN operational amplifier 
(simulation) 
Voltage gain 49dB 58dB 
V, @ \kHz 41.5nV/JHz 4.3 nV/VHz 
fn (Vn) 310kHz 1.9kHz 
V, (at midband) 4.6 nV/V Hz 2.9 nV/V Hz 
Input offset voltage - < ImV 
I, @ \kHz - 1.1 pA/VHz 
Sn Un) 3 1.8kHz 
I, (at midband) - 0.7 pA/V Hz 
Input bias current - 1.54uA 
Input offset current - 5nA 
Power consumption 120A @1.8V 128A @1.8V 
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Fig. 14. Chip photograph of V-NPN operational amplifier. 


However, the current V-NPN circuit has very limited RF per- 
formance because its /; is an order of magnitude lower than that 
of MOSFET. Due to its low f;, it is difficult to apply V-NPN to 
higher frequency circuits. In this paper, we propose two ways 
to increase its operating frequency. One is a simple fabrication 
process change and the other is a receiver architecture change. 





Mixer using 
V-NPN 






Lo4 BBA using 
MOSFET or V-NPN 





Fig. 15. The impact of V-NPN for single-chip radio. 


Fig. 16 shows how thin base width can be obtained in two ways. 
One is to use a separate shallower p-well implant and the other 
is to use shallower deep n-well implant processes. To validate 
this simply, V-NPN with four emitter fingers was simulated 
using Athena and Atlas [23]. We followed the same process 
steps as in [24]. Fig. 16(a) shows the simulated cross view and 
Fig. 16(b) plots the two-dimensional (2-D) net doping profile of 
the V-NPN through the cutting-plane line A in Fig. 16(a). The 
fz versus base width by keeping peak base doping constant at 
5 x 101"/em? is shown in Fig. 17(a) before collector-to-emitter 
punchthrough at Vog = 1 V. Fig. 17(b) shows how f; of V-NPN 
can also be improved by changing deep n-well implantation en- 
ergy before pinch-off at Voz = 1 V. As can be seen, more 
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Fig. 17. (a) ft versus the base width of V-NPN and (b) f; versus deep n-well 
implantation energy. 


than 10 GHz of f; can be readily obtained with one additional 
process. 

The second method is to change the receiver architecture, 
that is, to adopt the dual-conversion receiver [25] as shown in 
Fig. 18. The advantages of the dual-conversion receiver are as 


(a) Simulated cross view and (b) 2-D net doping profile of V-NPN for deep n-well implantation dose of 2 x 10% 
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Fig. 18. Dual conversion receiver adopting V-NPN. Note that dual conversion 


high-IF receiver allows on-chip image rejection filter implementation in CMOS 
[28], [29]. 


follows: no RF channel-select frequency synthesizer required, 
design flexibility (for example, giving gain at IF stages), less de 
offset, weak LO pulling, and low LO leakage, compared with 
DCR. However, the dual-conversion receiver has disadvantages 
in which additional mixers require more power, noise, and dis- 
tortion, image rejection filter augments the die area, and image 
rejection is limited by gain matching and LO deviation from 
quadrature [26]. Because the second mixer and following BBA 
circuits of the dual-conversion receiver process the baseband 
signal, the 1/ f noise and dc offset characteristics of these blocks 
have a considerable influence on the baseband signal. Therefore, 
if the LNA and first mixer are implemented using MOSFET de- 
vices with high f; and the second mixer and following BBA 
circuits are implemented with the combination of V-NPN and 
MOSFET, the operating frequency can greatly be extended ex- 
ploiting all the advantages of V-NPN circuits. In the same way, 
this can be applied to the Weaver DCR [26] as in Fig. 19 that has 
the image rejection capability by the self-aligning image-rejec- 
tion mixer. Therefore, the pertinent use of V-NPN and MOSFET 
in the dual-conversion receiver and Weaver DCR can extend the 
operating frequency of DCR with all the inherent advantages of 
V-NPN DCR. 
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Fig. 19. Weaver DCR adopting V-NPN. 


VII. CONCLUSION 


We have presented the electrical characteristics of V-NPN 
available in deep n-well 0.18-4m CMOS technology. A 
double-balanced RF mixer using V-NPN shows almost free of 
1/f noise as well as an order of magnitude smaller dc offset 
with other characteristics comparable with the CMOS one and 
12 dB flat gain up to the frequency higher than the current 
cutoff frequency of the V-NPN transistor itself. The V-NPN 
operational amplifier for BBA circuits has higher voltage gain, 
better noise performance, and better matching than the CMOS 
one at the same current. These circuits using V-NPN can have 
great impact on the possibility of high-performance direct-con- 
version receiver implementation in CMOS technology. With 
further scaling of CMOS, and/or one additional base implant 
process step, and/or the adoption of the dual-conversion ar- 
chitectures and Weaver DCR, very high-performance DCR 
comparable to those obtained from pure bipolar or BiCMOS 
can be fabricated from low-cost CMOS technology. 
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Highly Integrated Direct Conversion Receiver for 
GSM/GPRS/EDGE With On-Chip 84-dB Dynamic 
Range Continuous-Time A ADC 


Yann Le Guillou, Olivier Gaborieau, Patrice Gamand, Martin Isberg, Peter Jakobsson, Lars Jonsson, David Le Déaut, 
Hervé Marie, Sven Mattisson, Laurent Monge, Torbjérn Olsson, Sébastien Prouet, and Tobias Tired 


Abstract—This paper describes a highly digitized direct con- 
version receiver of a single-chip quadruple-band RF transceiver 
that meets GSM/GPRS and EDGE requirements. The chip uses 
an advanced 0.25-j4m BiCMOS technology. The I and Q on-chip 
fifth-order single-bit continuous-time sigma-delta (XA) ADC 
has 84-dB dynamic range over a total bandwidth of +135 kHz for 
an active area of 0.4 mm?. Hence, most of the channel filtering 
is realized in a CMOS IC where digital processing is achieved at 
a lower cost. The systematic analysis of dc offset at each stage of 
the design enables to perform the dc offset cancellation loop in 
the digital domain as well. The receiver operates at 2.7 V with a 
current consumption of 75 mA. A first-order substrate coupling 
analysis enables to optimize the floor plan strategy. As a result, the 
receiver has an area of 1.8 mm?. 


Index Terms—Analog-to-digital conversion, BiCMOS, con- 
tinuous time, dc offset, direct conversion, EDGE, front-end, 
GPRS, GSM, IIP2, low-noise amplifier (LNA), mixer, self-mixing, 
sigma-delta (XA). 


I. INTRODUCTION 


HE global system for mobile communication (GSM) 

launched the second-generation system (2G) for cellular 
communication on a worldwide market. Today, the trend is to 
increase the number of data applications and the data rates. 
Enhanced data rates for GSM evolution (EDGE)—a 2.75G 
system—triples the GSM data rate going from a Gaussian min- 
imum shift keying (GMSK) with 1 bit per symbol to an 8-level 
phase shift keying constellation (8-PSK) with 3 bits per symbol. 
It uses the GSM infrastructure and has the same symbol rate 
of 270 kS/s. To keep 2.75G system solutions cost-effective, 
the bill of materials (BOM) must be reduced as well as the 
power consumption. In this perspective, a direct-conversion 
receiver (DCR) is a very attractive architecture [1]-[4]. It 
eliminates the need for both IF and image reject filtering and 
requires only a single oscillator (LO) as illustrated in Fig. 1. 
Using a high dynamic range ADC, analog gain control (AGC) 
can significantly be reduced and most of the selectivity can 
be achieved in the digital baseband processor. Integrating the 
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high dynamic range ADC on the RF-IC, the CMOS baseband 
becomes purely digital. It can then take advantage of CMOS 
process shrinking to reduce the overall power consumption and 
cost over the generations. 

This work presents a DCR with an on-chip 13.5-bit resolution 
SA ADC over a bandwidth of +135 kHz. Section II focuses on 
the DCR design and techniques used to address the well-known 
weakness of DCR such as LO leakage, the self mixing, the fi- 
nite second-order intercept point (IIP2), etc. [4], [5]. A brief de- 
scription of the 0.25-j4m BiCMOS technology associated with a 
first-order substrate coupling analysis is provided in Section II. 
Experimental results obtained from silicon implementation are 
presented in Section IV. Finally, in Section V, conclusions are 
drawn. 





II. CIRCUIT DESIGN 


The quad-band DCR is shown in Fig. 2. The low-band 
(LB) term is used for the GSM900 and GSM850 systems 
(880-960 MHz) and the high-band (HB) is for the DCS1800 
and PCS1900 systems (1805-1990 MHz). 

The low-noise amplifiers (LNAs) consist of four differential 
transconductors recombined through a cascode stage into one 
common resistive load for each band. The RF outputs of the 
LNAs are AC coupled to the in-phase (J) and quadrature-phase 
(Q) mixers so that the low-frequency distortion generated by the 
second-order nonlinearities in the LNA is blocked to prevent 
leakage through the mixer. The multiplier cells of the mixers 
use 1/2 or 1/4 sub-harmonic LO signal when high-band or 
low-band is selected, respectively [6], [7]. This LO configura- 
tion ensures sufficient frequency separation between the VCO 
frequency and the largest received blockers and their harmonics. 
It avoids VCO pulling and the associated LO phase noise degra- 
dation that would degrade the sensitivity performance in pres- 
ence of interferers. The baseband chain includes a third-order 
low-pass filter that prevents the interferers from saturating the 
high dynamic range 1-bit continuous-time fifth-order A ADC. 
The bit-stream coming from the ADC drives a low-voltage slew- 
rate controlled digital output buffer. 


A. Low-Noise Amplifier (LNA) 


Usually to achieve the best compromise between gain, lin- 
earity, noise, and input matching, emitter degeneration is pro- 
vided by an inductance [8]. The requirement to combine ex- 
tremely low noise figure (NF) LNA in a small area tends to relax 
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Fig. 1. Direct-conversion receiver architecture. 
Fig. 2. Block diagram of the implemented quad-band DCR. 


the inductive degeneration. Hence, the differential transconduc- 
tance of each LNA is partly degenerated with an inductor and 
partly with an ac shunt feedback between the input and the 
output (see Fig. 3) [9]. However, in the ac shunt feedback config- 
uration, the parallel input impedance depends on the feedback 
impedance Zp and the resistive load 2; Then, to decrease the 
influence of the resistive load without increasing the area of the 
LNA, the parallel input impedance of 150 () is achieved by half 
an ac shunt feedback and by half an inductive degeneration feed- 
back. After a parasitic extraction, the simulated return loss of 
this LNA is better than —22 dB in all bands. The NF is 2.2 dB 
whereas the gain and the ICP1 have been simulated respectively 
at 25 dB and —21 dBm for a current consumption of 8.7 mA. 


B. Mixer 


The J/Q direct conversion mixers are double-balanced 
Gilbert-type mixer topology as shown in Fig. 4. This topology 
provides inherently high IIP2 [4], [6], [7], [10], [11]. The 
resistive 100-(Q degeneration (Rg) and the 7.8-mA current 
consumption has been chosen to trade off NF, I[P3, and input 
impedance. Since the 1/f noise spectrum of the mixers falls on 
top of the desired signal at baseband, only small NPN bipolar 
transistor with 40 GHz fr were used in the mixers design to 
reduce the effect of flicker noise at the mixer output [11]. Tran- 
sistor Q1 (Q2) drives Q5—Q6 (Q3-Q4) and Q9-Q10 (Q7-Q8) 
switch core transistors of BBI and BBQ, respectively. When 
properly scaled, the J;, /o, 3, and J4 current matching rely on 
the Fp resistors area while the switch core Q3—Q10 transistors 
can remain small. As a result, the parasitic capacitors are small. 
Thus, the LO transitions are sharp and the random modulation 
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Fig. 3. Circuit diagram of the LNA. 
of the switching time instants are small. Consequently, the 
flicker noise at the mixer output is minimized [11]. 

Ry and Rp are boron-doped polysilicon resistors. This type 
of resistors has good matching and flicker noise performance 
[12] as well as a precise 1/f noise modeling derived from ex- 
perimental measurement. 

A common centroid structure for the mixer-core and the 
mixer load FR is required to compensate for thermal gradient 
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Fig. 4. Circuit diagram of the mixer. 


effect and component mismatch that could result in gain and 
phase imbalances. The RF and LO lines crossing are perpen- 
dicular to reduce RF self-mixing. 


C. LO Divider and Buffer 


The VCO is running at two (HB) or four (LB) times the RF 
frequency. It prevents from VCO pulling [7]. The required LO 
dividers by 2 or 4 generate the J and Q quadrature. 

As illustrated in Fig. 5, the fast divider by two uses a clas- 
sical differential bipolar structure composed of master and slave 
latches. The 50-Q Rp resistance is a compromise between J and 
Q matching, noise, and linearity requirement. Extensive Monte 
Carlo simulations after a parasitic extraction has enabled to fix 
the tail current (41) at 4 mA to achieve an J/Q phase error lower 
than 1°. Typically, the LO signal edges slope is as sharp as 
6 GV/s. It enables to minimize the flicker noise at the mixer 
output [11]. 

The LO signal path is 90° shifted from the RF signal path (see 
Fig. 13) to minimize the magnetic coupling between LNA and 
LO circuits. 


D. Baseband Filter Circuit 


After downconversion and prior to digitization by the ADC, 
the baseband filter (BBF) completes the receiver chain. The 
BBE circuit enables the reduction of the dynamic range require- 
ment on the ADC, through two mechanisms: first, by ampli- 
fying the wanted signal above the noise floor of the ADC, and 
second, by filtering the undesired signals—adjacent channels, 
blockers—so that they do not overload the ADC. As shown in 
Fig. 6, a third-order filter is sufficient to attenuate the worst 
blocker case, which is at 3 MHz. Hence, most of the channel 
filtering is performed in the digital baseband processor. 

A first real pole is conveniently realized at the mixer output: 
its location early in the receiver chain alleviates the IP2 and IP3 
requirements of following stages. The impedance is lower at the 
first stage of the baseband filter and scaled through the path to 
optimize the noise and the die area. As a result, the Sallen & 
Key stage, which is a complex pole filter, is introduced after 





Fig. 5. Circuit diagram of the LO divider. 

17 dB of gain in BB1. The global BBF amplification is 25 dB. 
Consequently, a 5-mV dc offset at the mixer output will result 
in 89-mV dc offset at BBF output. Hence, provided that the dc 
offset does not overload the ADC, the de offset cancellation can 
be fully achieved in the digital baseband processor. 

The unity gain buffer of the Sallen & Key stage, BB2, is intro- 
duced in the feedback path [13]. If located in the forward path, 
the buffer output impedance, together with feedback capacitor 
would build a parasitic zero, thus enlarging the out of band gain. 
The next amplifier, BB3, provides some gain trimming, and the 
final one, BB4, is designed to interface with the ADC. 

The nominal 3-dB bandwidth of the BBF circuit is 208 kHz 
while the EDGE requirement is 135 kHz. This allowed more 
than 35% spread for process and temperature variations without 
corrupting the EDGE requirement. In addition, the group delay 
variation is below 0.18 jus even when the 3-dB bandwidth is 
at 135 kHz. Consequently, the BBF circuit does not need addi- 
tional tuning circuitry to compensate for process and tempera- 
ture variations. 
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Fig. 6. Block diagram of one channel of the baseband filter (BBF). 
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Fig. 7. ADC dynamic range requirement. 

The total 2.5-nF capacitance for J and Q@ BBF has been 
stacked on the BBF active part to save area. As a result, the 
BBF is 0.5 mm? for a current consumption of 13 mA. 


E. Fifth-Order Continuous-Time NA ADC Circuit 


GSM/GPRS/EDGE requires the reception of signals between 
—104 dBm and —15 dBm [14]. The state-of-the-art sensitivity 
of —109 dBm is the target. The specified 10~° bit-error rate 
(BER) requires a 7-dB signal-to-noise ratio (SNR) and thus 
leads to the system noise floor of —116 dBm. The ADC should 
not be the dominant noise source for a power efficient imple- 
mentation. As illustrated in Fig. 7, its noise floor is 8 dB below 
the analog front-end’s. If the out-of-band interferers are filtered 
so that. they do not overload the ADC, the dynamic range re- 
quirement is then reduced to 84 dB and can be handled with 
a fifth-order NA ADC [15]. A low-pass continuous-time (CT) 
NA ADC is desirable since it enables a low-power implemen- 
tation without the need for an anti-aliasing filter [16]—[19]. 

Fig. 8 shows the block diagram of the implemented ADC 
derived from [16]. The fifth-order loop filter has two complex 
conjugate poles introduced by the local feedback coefficients b; 
and by. They appear as notches in the shaped quantization noise 
(see Fig. 9). One of the notches is located at 78-kHz offset fre- 
quency. The other one is at the edge of the signal band. The feed- 
forward coefficients a; provide first-order roll-off at open-loop 
unity gain for stability reasons. Large signal stability is achieved 
by clipping the output integrator starting at the fifth integrator 
[18]. The input stage of the ADC consists of an operational 
transconductance amplifier in an integrating feedback configu- 
ration. The rest of the loop filter is implemented by means of 
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Fig. 8. Block diagram of the fifth-order CT SA modulator. 





0 aa Pope emer ee Te 2) pnp any 


RBW=800Hz © | 


: : [—— SNR, =93.170B : 
- SNR a sptherstnemai7 02 8608 i4 








t 
eS 
o 


1 
> 
oO 


~80 


magnitude(dB) 











107 107" 10° 
{(MHz) 


Fig. 9. Simulated output spectrum of the fifth-order CT “A modulator. 


transconductor-C (G,,,-C) integrators for low-power reasons. 
The 1-bit feedback DAC is inherently linear. It switches resistors 
between positive and negative reference voltages derived from 
an on-chip bandgap reference circuit. A return-to-zero (RTZ) 
coding scheme is used to minimize the inter-symbol interfer- 
ence [20]. The biasing technique used for the design of temper- 
ature-insensitive g,,-C' integrators avoids the need for tuning 
circuitry [19]. 
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Fig. 10. Block diagram of the slew-rate limited output buffer. 


As illustrated in Fig. 9, the simulated quantization noise is 
93 dB below the maximum input signal at the modulator input, 
which is —3 dB compared to the overload level (—3 dBFs). 
This enables +150-mV dec offset at modulator input before 
overloading. Therefore, the de offset cancellation can be fully 
achieved in the digital domain. The resistive DAC and the input 
stage of the ADC limit the SNR to 84 dB. This SNR is further 
degraded by 1.5 dB when the jitter of the 13-MHz clock is 
DcpSeine- 


F. Slew-Rate Limited Output Buffer 


The digital output buffers are used to drive the output pins 
with the 13-MHz bit-stream signals coming out of the ADC. 
Note that having a single-bit ADC allows to limit the required 
number of buffers to 3(J + Q + clock) and thus to save power 
consumption and silicon area. 

The digital output signal is shaped so as to limit its harmonics 
levels that might couple with input RF signal. For this purpose, 
a slew-rate limited buffer fed with a 1.5-V internally regulated 
Vrer is designed according to Fig. 10. The voltage regulation 
is made thanks to a classical series regulator together with a 
100-pF MIM decoupling capacitor integrated over the entire 
block. The slew-rate limited buffer is made of two inverters 
in parallel feeding the power output PMOS and NMOS tran- 
sistors with the digital signal. These inverters allow to put the 
buffer in tri-state mode by forcing the gates of the output PMOS 
and NMOS to Voc and GND, respectively. They are sized to 
drive the output transistors in such a way to avoid direct cur- 
rent feedthrough from the output PMOS (NMOS) to the output 
NMOS (PMOS) during transitions. Hence, all the current is de- 
livered to the load. A capacitor is added to the inverters output 
to help slowing down the current steered from Vazr (GND) 
through the output PMOS (NMOS). This helps limiting the high 
frequency harmonic levels on the output signal and avoiding 
large voltage spikes on the supply and ground rails. Finally, the 
13-MHz clock frequency is slow enough to avoid high electro- 
magnetic coupling with any close bond wires connected to sen- 
sitive blocks. 





Ill. PROCESS IMPLEMENTATION AND FLOORPLAN STRATEGY 
A. Process Implementation 


The quad-band receiver (as part of a fully integrated trans- 
ceiver) has been fabricated in the RF 0.25-j1m BiCMOS ma- 
ture technology [21]. This technology features 40-GHz f7 and 
90-GHz fmax NPN devices combined with high-quality pas- 
sives and has been optimized for high frequency, low noise, 
and low supply current applications. Special effort has been put 
on the quality of passive components such as matching, quality 
factor, and deep trench isolation (DTI). 

Low current consumption is achieved by optimizing the fr 
versus the collector current and by using deep trench technique 
to reduce collector—substrate capacitance. For instance, this pro- 
vides only 150-j4A/j:m? current density for 25-GHz fr and less 
than 9-fF collector—substrate capacitance for a 0.4 x 20 pm? 
device. 

Diffused and polysilicon resistors with less than 0.6%-j1m 
and 2.8%-j:m respective matching performance are particularly 
adequate for architectures where dc offset as well as J and Q 
mismatch need to be minimized. 

Low k dielectric, thick metal, and DTT allow on-chip inductor 
Q as high as 20 at 2 GHz for a 1.5-nH coil. 

The backend of this BiCMOS technology has been optimized 
to allow high routing density. It includes an embedded high-den- 
sity 5-nF/mm? MIM capacitor built close to the top metal levels, 
which ensures very low parasitic elements to the substrate. 


B. Floor Plan Strategy 


A high level of function integration increases the sensitivity 
of the circuit to crosstalk. Sources of interferences are related to 
digital to analog coupling, electromagnetic (EM) coupling be- 
tween inductors, routing traces coupling, interconnections, and 
signal injection through the substrate. Several tools exist to esti- 
mate these effects but they require a huge computation time and 
therefore do not allow fast design/layout iterations. 

The methodology we have put in place is based on a simple 
model that “simulates” point-to-point effects due to EM or sub- 
strate coupling as a function of the distance, substrate resistivity, 
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and frequency. The model consists of a “black box” H(jw), 
which is connected between two circuit blocks where crosstalk 
has to be analyzed. The transfer function H(jw) is built with 
scalable R and C’' elements and is based on empirical equations 
[22] validated by a full wave analysis. 

The advantage of this concept is to detect sensitivities as early 
as possible during the design phase in order to anticipate for 
potential risks. At this stage of the design, high accuracy is not 
necessarily required. This method then gives good indications 
for an optimized floor plan. 

For instance, the effect of the output bit-stream of the ADC to 
the input of the LNA has been considered. Indeed, the voltage 
swing at the ADC output is close to 1.5 V peak, which might 
severely disturb the input signal of the LNA and degrade the 
sensitivity. We have shown that for a distance of 1 mm between 
the ADC output and the RF LNA inputs, the voltage amplitude 
at ADC output has been reduced down to —125 dBv at LNA 
input. Therefore, this effect is negligible. This method is used 
to optimize layout guidelines with respect to critical function 
performances. The isolation criterion is defined according to the 
maximum spurious level that can be tolerated between two cir- 
cuit blocks. 

The methodology has been completed by adequate measures 
to reduce interferences. Specific EM software [23], which can 
take into account heterogeneous structures, has been used to 
find the optimum combination of deep trenches and guard rings 
to improve the overall isolation between blocks. In particular, 
adding a deep trench to a guard ring increases the isolation by 
5-8 dB at 1 GHz depending on the distance. We should remark 
that ac coupling through the substrate depends on its resistivity. 
For ac decoupling of the supply lines, we have extensively used 
the two top metal layers of the process with the embedded MIM 
capacitor to provide a good decoupling characteristic without 
any silicon area penalty. 

Thanks to this methodology, we have reached a very com- 
pact layout without compromising the performance of the trans- 
ceiver. The silicon area used for the receiver part (from the LNAs 
to the output bit-stream of the ADC) is only 1.8 mm?. 


IV. EXPERIMENTAL RESULTS 


The measured analog front-end receiver (without the ADC) 
NF is 2.3 dB and its I[P3 is —9 dBm, which is in agreement with 
the simulated results (see Section II). The dc offset is typically 
below 3.5 mV in all bands at the BBF output. 

The measured SNR and the signal-to-noise and distortion 
ratio (SNDR) of a single NA modulator are plotted in Fig. 11. 
The peak SNDR is 81.8 dB and peak SNR is 82.5 dB in 135-kHz 
bandwidth for a single modulator. It corresponds to an effec- 
tive number of bits (ENOB) of 13.5. The dynamic range (DR) 
is 84 dB. The IM2 and IM3 distance are 95 dB and 93 dB, re- 
spectively. Since the modulator input is limited at —2.87 dBFs, 
this leaves enough margin to avoid saturation. The total current 
consumption of ADC I and @ including the biasing is 2.8 mA 
under 2.5 V. The resolution and signal bandwidth corresponds 
to a figure of merit of P/(2°N°®*BW) = 2.2 pJ/conversion, 
which is equal to [24]. In this work, the power consumption 
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Fig. 11. Measured SA ADC performances as a function of input signal level. 
TABLE I 
RECEIVER PERFORMANCES 
Bands (MHz) 850 900 | 1800 | 1900 
NF (dB) Sata es ea|..33 3 
LO re-radiation (dB) “112 | -118 | -130 | -111 
AM suppression (dBm) >-24 | >-24 | >-25 | >-24.5 
Required AM suppresion (dBm) [25] sO EST Sf -31 31 
Blocker level @ 3MHz >-20 | >-20 | >-20 | >-21 





Required blocker level @ 3MHz /25] | -23 -23 -26 -26 


all + 






























































Typical IQ phase match (°) <l <l <l <l 
Typical IQ amplitude match (dB) 0.07 (1 gg ft OG 0.15 
Typical sensitivity(dBm) -109 | -109 | -108 | -108 

ADC dynamic range (dB) 84 84 84 84 
3dB baseband BW (kHz) 208 208 | 208 208 
Group delay to 100kHz (us) <0.18 | <0.18 | <0.18 | <0.18 
Rejection at 3MHz (dB) >70 >70 >70 >70 

Power (incl synth) (mA) 75 be) wh) 75 


figure includes the J and Q ~A modulators, the biasing cir- 
cuitry, and the delay-locked loop (DLL). The DLL generates 
the different 13-MHz clock phase shift necessary for the RTZ 
clock scheme implementation. The ADC has been tested over 
the temperature range [—30°C, +85 °C] and over the voltage 
range [2.2 V, 3.5 V] without observing any degradation in the 
linearity. 

As illustrated in Fig. 12, the average sensitivity at 23°C is 
—109, —108, and —108 dBm, respectively, for EGSM, DCS, 
and PCS bands. In addition, the sensitivity is not degraded at 
13-MHz harmonics (dotted lines in Fig. 12), which validates the 
floor plan strategy detailed in Section III. 

The main receiver performances are presented in Table I. 
The re-radiation of the LO signal measured at the LNA input 
is greater than —110 dBm for all bands. The J/Q quadrature, 
measured at the BBF output, is accurately generated with 1° 
and 0.2 dB. In addition, the worst case 3-MHz blocker level for 
a 2.4% class II RBER with a wanted signal at —98 dBm [25] can 
be as high as —20 dBm for GSM850/900 bands and —21'dBm 
for PCS1800/DCS1900 bands. This gives at least 3-dB margin 
for GSM850/900 and 5-dB margin for PCS1800/DCS1900 
compared to the 3-MHz blocking test requirement [25]. Conse- 
quently, the LO chain exhibits good phase noise performances. 
The NF measured for the whole receiver at UA ADC output is 
below 3.1 dB for all bands. In the application, IP2 is verified by 
measuring AM suppression performance as specified in [25]. 
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Fig. 12. 





Fig. 13. 


Microphotograph of the receive path. 


A class II RBER of 2.4% in all bands is achieved even in the 
presence of a —-24-dBm GMSK modulated. This results in 7 dB 
of margin on the —31-dBm requirement [25]. 

The layout of the receiver is shown in Fig. 13. Its area is 
1.8 mm?. 


V. CONCLUSION 


The on-chip low-pass continuous-time A ADC is 84-dB 
dynamic range over a bandwidth of +135 kHz. Therefore, a 
third-order baseband filter is sufficient to attenuate the worst 
case blocker at 3 MHz. As a result, most of the selectivity is 
performed in the digital domain. In addition, the dc offset at 
mixer output is only amplified by 25 dB in the baseband filter 
and does not overload the ADC. Consequently, the de offset can- 
cellation is performed in the digital domain as well. The base- 
band buffer and ADC circuits have been enhanced to accom- 
modate process and temperature variations. Consequently, no 
calibration or tuning circuitry is necessary. Moreover, a first- 
order substrate coupling analysis that optimizes the floor plan 
strategy with respect to area and crosstalk has been presented 
and validated since no degradation of sensitivity performance 


(c) 


Sensitivity measurement results at 23°C for EGSM (a), DCS (b) and PCS (c) bands. The dot lines represent 13 MHz harmonics. 


has been observed at 13-MHz harmonics. The presented quad- 
band GSM/GPRS/EDGE direct-conversion multimode receiver 
with on-chip NA ADC consumes 75 mA under 2.7 V for an 
area of 1.8 mm?, making it suited to the 2.75G system solution 
requirements. 
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An Adaptive ENG Amplifier for 
Tripolar Cuff Electrodes 
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Abstract—Electroneurogram (ENG) recording from tripolar 
cuff electrodes is affected by interference signals, mostly gener- 
ated by muscles nearby. Interference reduction may be achieved 
by suitably designed amplifiers such as the true-tripole and 
quasi-tripole systems. However, in practice their performance is 
severely degraded by cuff imbalance, resulting in very low output 
signal-to-interference ratios. Although some improvement may be 
offered by post filtering, this considerably increases complexity, 
size and power dissipation, rendering the approach unsuitable for 
the development of a high-performance ENG recording system 
which is fully implantable. This paper describes an integrated, 
fully implantable, adaptive ENG amplifier developed to auto- 
matically compensate for cuff imbalance, and thus significantly 
improve the quality of the recorded ENG. Measured results 
show that the adaptive ENG amplifier has a yield of 100%, a 
cuff imbalance correction range of more than +40%, and an 
output signal-to-interference ratio of about 2/1 (6 dB) even for 
+40% imbalance. The latter should be compared with an input 
signal-to-interference ratio of 1/500 (—54 dB). The circuit was 
fabricated in 0.8-4m BiCMOS technology, has a core area of 
0.68 mm?, and dissipates 7.2 mW from +2.5 V power supplies. 
The adaptive ENG amplifier advances the state-of-the-art in 
implantable tripolar nerve cuff electrode recording techniques. 


Index Terms—Analog integrated circuits, cuff imbalance, ENG 
amplifier, implanted devices, tripolar cuff electrodes. 


I. INTRODUCTION 


LECTRONEUROGRAM (ENG) recording techniques for 
ER peripheral nerves using cuff electrodes offer a noninva- 
sive way of obtaining information regarding nerve operation 
[1]. In the case of spinal cord injury this information can be 
used for the improvement of implanted devices used for reha- 
bilitation. Monitoring nerve operation allows some level of in- 
tervention by means of functional electrical stimulation for par- 
tial control of organs suffering from paralysis and for blockage 
of unwanted sensory and/or motor signals. Applications that 
have been investigated include the correction of foot-drop, stim- 
ulating hand-grasp, and controlling the urinary bladder after 
spinal cord injury [2]-[4]. 

Recording ENG effectively is not a trivial task, as the micro- 
volt-order (typically 1-5 V) nerve signals are often obscured 
by the millivolt-order (typically 1 mV) electromyogram (EMG) 
from muscles nearby and by noise, notably white noise from 
the interstitial fluid and from the electrode—tissue interface [5], 
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[6]. Furthermore, the spectra of the two signals overlap con- 
siderably, making separation by means of filtering very diffi- 
cult [7]; the ENG has an energy in the 500 Hz—10 kHz band 
with maximum power around 1 kHz, while the EMG lies in the 
1 Hz-3 kHz band and peaks at about 250 Hz [6]. Various ENG 
amplifier configurations make use of the properties of the cuff 
electrodes, mainly the linearization of the EMG potential field 
inside the cuff [8]. Improved performance in terms of EMG re- 
duction is offered by tripolar cuffs (i.e., cuffs with three equally 
spaced ring electrodes embedded in the inside wall [1]). Due 
to this linearization, the EMG potential differences between the 
central electrode and the outer electrodes are equal and opposite 
and can be cancelled by a differential amplifier arrangement. 
By contrast, the ENG signal does not cancel in this way and 
can be recovered. The amplifiers used with tripolar cuffs are the 
quasi-tripole (QT) [1], [6] and true-tripole (TT) [9]. However, 
EMG reduction in these systems is affected by the departure of 
the cuff—tissue interface from its ideal model, caused by factors 
like cuff asymmetry and tissue growth inside it after implanta- 
tion, resulting in cuff imbalance as explained in more detail in 
Section II. 

To automatically compensate for the possible presence of cuff 
imbalance, and thus minimize EMG artifacts in nerve cuff elec- 
trode recording, an adaptive version of the TT, termed the adap- 
tive-tripole (AT), has been proposed [10] and its first integrated 
realization was reported in [11]. However, the realization in [11] 
showed poor performance in terms of output signal-to-interfer- 
ence ratio (SIR),! harmonic distortion, cuff imbalance correc- 
tion range, and yield. This paper describes an improved realiza- 
tion of the AT which overcomes all the limitations of the first 
design. These enhancements were necessary in order to make 
the system fully implantable for the targeted biomedical appli- 
cation (i.e., bladder implant). The adaptive ENG amplifier to be 
described has a chip yield of 100%, a cuff imbalance correc- 
tion range of more than +40%, and an output SIR of no less 
than 2 dB even for +40% imbalance. The circuit was fabricated 
in 0.8-y7m BiCMOS technology, occupies 0.68 mm?, and dissi- 
pates 7.2 mW from +2.5 V power supplies. 

The remaining sections of this paper are organized as fol- 
lows. In Section II, the basic principles of ENG recording 
from tripolar cuff electrodes are briefly reviewed. Section III 
describes the AT architecture and examines the effect of phase 
errors on system performance. Section IV describes the circuit 
design of the various building blocks, while measured results 
are presented in Section V. Finally, conclusions are drawn in 
Section VI. 


ISIR refers to the ratio of the peak amplitude of the ENG signal over that of 
the EMG signal. 
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potential fields inside cuff 
EMG 


potential 


Fig. 1. Lumped-impedance model of the cuff and idealized ENG and EMG 
potentials inside the cuff [6], [12]. Typical impedance values: Z:9 = 200 2, 
Zur. = 1.25 kQ, Zero.3 = 1k. 





(a) 


Fig. 2. Tripolar ENG amplifier configurations. (a) Quasi-tripole (QT). 
(b) True-tripole (TT). 


II. PRINCIPLES OF TRIPOLAR CUFF ELECTRODE RECORDING 


The ENG signal results from the action potentials propa- 
gating along the nerve fibers, which cause small action currents 
to flow through the fiber membranes into the extrafascicular 
medium [5]. Confinement within an insulating cuff causes the 
local impedance to be higher than outside the cuff, so that the 
action currents give rise to measurable potentials between the 
cuff electrodes. Simply stated, the nerve is an insulator, while 
the space between the nerve bundle and the cuff is filled with 
connective tissue and/or conducting fluid. 

A very important function of the cuff is that, as a uniform 
insulating tube, any externally applied potential differences be- 
tween the ends will produce a linear gradient inside [8]. This 
linearization effect is depicted in Fig. 1 in the basic electrical 
lumped-impedance model of the cuff [6], [12]. In this model, 
Zi; and Z;2 represent the tissue impedances inside the cuff, Zo 
is the tissue impedance outside the cuff, 7.1, Z.2, and Z-3 are 
the electrode-tissue contact impedances, ipyq(t) is the inter- 
fering EMG current that flows inside the cuff, and vgna(t) is 
the ENG voltage. At the frequencies of interest, the impedances 
may be regarded as purely resistive with typical values listed in 
the caption of Fig. 1. The EMG potentials across nodes ab and 
cb in Fig. 1 appear as anti-phase while the respective ENG po- 
tentials appear in-phase. Given the linear gradient of the EMG 
potential inside the cuff and equally spaced tripolar electrodes, 
the residual EMG at the output from either the QT or TT am- 
plifier configurations (Fig. 2) will ideally be zero. However, in 
practice Z,; and Z;2 are subject to uneven variations which de- 
stroy the tripolar cuff symmetry, resulting in cuff imbalance, de- 
fined as 


A 
Aims —_ (577) x 100%, Aah < 100%. (1) 


Zi + Ze 





Fig. 3. 


Adaptive-tripole (AT) architecture. 


The two main reasons for the variations in Z;, and Z42 are 
inhomogeneous tissue growth inside the cuff after implantation 
and manufacturing tolerances in positioning of the electrodes 
[10]. Secondary reasons affecting cuff imbalance include the po- 
sition of the EMG source relative to the cuff [13]. Although the 
ENG signal recorded with the TT is about twice that recorded 
with the QT, the TT is much more sensitive to mismatch in 71 
and Z+2 than the QT. On the other hand, the QT, unlike the TT, is 
very sensitive to mismatches in 7.1, Ze and Z,3. In the case of 
the TT, assuming unity gain for the output amplifier (G, = 1), 
the residual EMG at its output is [14] 


Zeo(GiZu — G2Zi2) 
Zio + Zt1 + Zia 











Vo(eMa) = tema (t) (2) 
where G and Gy are the gains of the input differential ampli- 
fiers in Fig. 2(b). However, note that the term on the right-hand 
side of (2) can be made zero by adjusting G, and G2 to com- 
pensate for any mismatch between Z;; and Z;2 (this approach 
cannot be used with the QT). An automatic adjustment of the 
two amplifier gains is realized by the AT, which is described in 
Section III. 


Ill. ADAPTIVE TRIPOLE ARCHITECTURE 
A. System Description 


The block diagram of the AT implementation described 
in this paper is shown in Fig. 3. The system consists of two 
voltage preamplifiers, each with a fixed gain A, providing a very 
low-noise interface with the cuff electrodes. The preamplifiers 
are followed by two operational transconductance amplifiers 
(OTAs) with variable gains G',,; and G,,2, controlled by the 
differential feedback currents I;(t) and Iy2(t). The control 
stage operates by first obtaining the moduli of the currents at the 
output of the variable-gain OTAs and applying them to a current 
comparator to establish which is the largest. The comparator 
voltage output is subsequently applied to a large time-constant 
integrator which generates I; (t) and J 2(t). The variable-gain 
OTAs counterbalance the presence of cuff imbalance, ideally by 
equalizing the amplitudes of the EMG signals at their outputs. 
As aresult, when the output signals of the OTAs are summed at 
the input of the output-stage amplifier (gain G.,), the equal and 
anti-phase EMG signals from the two channels are cancelled, 
and the in-phase ENG signals are added and further amplified. 


B. Sensitivity to Phase Errors 


The AT achieves optimum artifact reduction when the EMG 
terms at the inputs of the output-stage amplifier (Fig. 3) are 
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exactly anti-phase. However, the use of ac coupling? in the 
preamplifiers for de offset cancellation (see Section IV-A) will, 
in the case of mismatched filters, introduce additional phase 
shifts between the composite signals Vi (t) and V2(t) in Fig. 3. 
The phase shifts will be more pronounced on the EMG as its 
frequency spectrum peaks at much lower frequencies than the 
ENG [6]. Even if there is some phase shift between the ENG 
terms of V(t) and V2(t), it will still be possible to detect neural 
activity in the relevant nerve bundles. 

Based on the above, it is desirable to establish the maximum 
tolerable phase mismatch between two first-order RC high-pass 
filters to achieve an output SIR of no less than unity. Assuming 
sinusoidal signals and a phase shift 7 + ¢ between the EMG 
term in V2(t) relative to that in V;(¢), then with reference to the 
cuff model in Fig. 1 and for Z; > Z2, Vi(t) and Vo(t) are 
given by 


Vi (t) —A ga! 


-(1 Fei te) 
2 


Veme sin(wit) + Venc sin(wat) (3) 


Vo(t)=A Veme sin(wit+¢)+VEne sin(wot 


(4) 


where Vang and Veme are the voltage amplitudes of upnc(t) 
and ipma(t)[Zro(Ze1 + Z12)/(Zto + Ze1 + Z2)] in Fig. 1, re- 
spectively, and w; and w» their respective frequencies. Further- 
more, assuming that J1(¢) and Io(t) in Fig. 3 have settled to 
their final values for a given Xjmp, such that 


Giat = Gmoll Te Ximb) 
Gm2 = Gel a Ximb) (5) 


where G',,. is the mean gain of each variable-gain OTA, the AT 
output is given by 


(1 a Xiap) 
3 


Vat AG Te Ge Veme[sin(wit)—sin(wit+¢)] 





+2Venc oo) (6) 


which using standard trigonometric identities modifies to 


(1 — Xiab) 
2 


Walt AG a Go Veme Vy 2 — 2cos(¢) 





x cos(wit — 0) + 2Vene sno) (7) 


where 6 = tan~+ [(cos(#) — 1)/sin(¢)] is the phase shift of 
the residual output EMG relative to the input EMG (i.e., seen at 
the electrodes). Thus, if 6 = 0, the AT will (ideally) eliminate 
EMG. However, if ¢ 4 0, the amplitude of the residual output 


2In an implantable ENG amplifier, ac coupling realized by RC high-pass fil- 
ters is also included in series with the cuff electrodes to prevent dc currents 
flowing through the tissue which would cause electrolysis, and to cancel de off- 
sets stemming from the electrodes [15]. However, since passive components are 
usually used for such filters, their cut-off frequency can be made. extremely low, 
thereby minimizing the possibility of phase shifts. 
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EMG will depend on ¢. From (7), the output SIR can be defined 
as 


4VenG 





STRout == : (8) 
(1 “> X?2,)Veme V/ 2— 2cos(¢) 
Thus, ¢ in radians can be calculated by 
= SIRin/STRout 
= O66 ATO Bf Sa 
gid ( (1 its Xap) ) . 








where SIRin = Venc/Vema. For example, if SIRouz = 1, 
SiRin? =41/500;..and, Xjmty) =-b4070,pthen i) = .2:0-55°. 
This can be converted to an error term ¢ for the maximum 
tolerable component mismatch of two first-order RC high-pass 
filters. Since the ENG does not exhibit very low-frequency 
components, a low-end ENG amplifier bandwidth of 100 Hz is 
usually realized [6]. The worst case ¢ for a cut-off frequency 
of 100 Hz is when the EMG frequency is also 100 Hz, giving 
€ = 1.89% between the two RC product values. It should be 
noted that although a mismatch between the —3-dB frequency 
of the two filters will also introduce magnitude errors, these 
will be seen by the control stage of the AT as cuff imbalance 
and corrected. 


IV. CIRCUIT DESIGN 
A. Low-Noise Preamplifiers 


The preamplifiers, being the front-end interface with the cuff 
electrodes, are required to exhibit very low-noise performance 
and have reasonable voltage gain (about 40 dB), so that low 
noise is not a concern for the design of the subsequent system 
stages. The exact gain of the preamplifiers is not important be- 
cause any gain mismatch between them will be compensated for 
by the control stage of the AT. Thus, a simple feedforward archi- 
tecture was employed as depicted in Fig. 4, thereby avoiding the 
complexity and noise of feedback networks. Noise optimization 
of the preamplifiers was explicitly described in [16], where it 
was shown that in order to achieve the required noise specifica- 
tion with minimum die area and power dissipation, the input dif- 
ferential pair transistors Q1 and Q2 in Fig. 4 should be bipolar. 
Because of this requirement, the complete adaptive ENG am- 
plifier was implemented in BiCMOS technology, although the 
control stage utilizes MOS transistors only. 

The preamplifier circuit in Fig. 4 consists of a simple 
BiCMOS OTA (Q1, Q2, M1, and M2) terminated in the load 
resistor R, (40 kQ, Ver is a de voltage source of 0.75 V), 
followed by a first-order bandpass filter, which restricts the 
bandwidth to about 100 Hz-10 kHz. The upper cut-off fre- 
quency is obtained by the combination of resistor Ry (500 kQ) 
and capacitor C, (27 pF), while the lower cut-off frequency is 
obtained by capacitor C2 (80 pF) with the series combination 
of transistors M6 and M7, the latter transistor pair forming a 
high value (20 MQ) grounded linear active resistor. In addition 
to eliminating low frequencies below the ENG passband, the 
high-pass section of the bandpass filter also removes some of 
the low-frequency flicker (1/ f) noise voltage tail and ensures a 
dc offset-free preamplifier output. The ac coupling mechanism 
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(to variable-gain 
stage) 
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Fig. 4. Preamplifier circuit. 


is very important since the succeeding variable-gain OTAs are 
driven single-ended, and thus, the presence of dc offset voltages 
(>1 mV) at their inputs would severely degrade the output SIR. 
By appropriate scaling of the aspect ratios of M6 and M7, a high 
value resistance is obtained with a maximum nonlinearity of 
0.25% for a signal swing of +85 mV. The dc bias voltages of 
M6 and M7 are provided by the diode-connected transistors M8 
and M9, respectively, which are in turn biased by the dc current 
sources Jpo and Jp3. 

As the base current of Q1 and Q2 cannot be supplied by the 
input interface, this was generated on-chip as shown in Fig. 4. 
Essentially, 03 generates a replica of the base currents of Q1 and 
Q2, which is fed into the pMOS current mirror M@3—M5 whose 
outputs feed the bases of Q1 and Q2, respectively. The base of 
04 is connected to ground to ensure that the emitter voltage of 
Q3 is at the appropriate level. Furthermore, the collector of Q3 
is connected to V,.. ¢ to mimic as far as possible the de condi- 
tions of Q1 and Q2 (the residual input de base current is about 
30 nA). The area of M4 and MS were carefully chosen so that for 
an 800-nA drain current, their noise contribution is negligible. 
The bias currents for the OTA and the base current reduction 
circuits are provided by the de current sources [,;. The value of 
Jy, was appropriately selected so that the input-referred r.m.s. 
noise voltage of the preamplifier is 290 nV (noise bandwidth of 
1 Hz—15 kHz). Both preamplifiers share the same current reduc- 
tion and biasing circuits. 

It should be noted that the preamplifiers could also be real- 
ized in CMOS technology by using the available paracitic lateral 
bipolar transistors. However, due to the poor matching of such 
devices, a larger die area and greater power dissipation would 
be required to meet the noise specification. 


B. Variable-Gain OTAs 


The composite signal at the input to each AT channel consists 
of EMG and ENG components with nominal peak—peak swing 
after preamplification of around +50 mV (for Ximp = 0) and 
+100 pV, respectively. The control stage is required to have suf- 
ficient gain to amplify the ENG to a reasonable amplitude (i.e., 
+20 mV) and also sufficient linearity to accommodate the EMG 





M4 





(from integrator) 


M8 M10 


Vss 


Fig. 5. Variable-gain OTA circuit. 


signal. The decision to use an OTA to implement each vari- 
able-gain stage was based on the following two reasons: 1) using 
an OTA, variable-gain capability can be very simply achieved by 
changing its tail current, and 2) the output current signal from an 
OTA simplifies the design of the subsequent full-wave rectifiers 
and current comparator circuits. The basic requirement is that 
each variable-gain OTA must have enough linear gain range to 
allow even for extreme Xj, = +40% as suggested in [13]. 
Although the nominal signal swing after preamplification 
with Ximp = +40% is expected to be about +70 mV, the linear 
input range of each variable-gain OTA was set to +85 mV 
to allow for some variation in the nominal EMG amplitude 
picked-up from the cuff electrodes. The variable-gain OTA was 
designed for operation in strong inversion and its simplified 
schematic is shown in Fig. 5. The circuit essentially consists of 
a symmetrical simple CMOS OTA (input transistors M1 and 
M2) with current mirrors M3—M10 of unity current ratio which 
in practice were regulated cascodes [17]. The gain of the OTA 
is controlled by the feedback current J;, and the circuit has two 
current outputs, J,; and J,2, each connecting to the input of a 
full-wave rectifier or to the input of the output-stage amplifier. 
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(from OTA G,,,,) 
I,= Vi ri T| 


(to comparator) 


(from OTA G,,,9) 


Vss 


Fig. 6. Two full-wave rectifier circuits. 


Assuming matched transistors and neglecting channel length 
modulation, each output current of the OTA in Fig. 5 is given by 
[18] 


TaegalakVif1+ V2, Wl < af | (10) 
ake k 

where V; is the input voltage, k = uCox(W/2L) is the transcon- 
ductance parameter, W and L are the channel width and length 
of the input transistors, j: is the carrier mobility, and Cx is 
the gate oxide capacitance per unit area. The relationship be- 
tween transconductance G',,, and V; can be obtained by taking 
the derivative of (10) with respect to V;, yielding 


V2 5ik (1 — (kV? /Tsi)) 
Vi= OVER p) 


For Vi < \/ Ip; /2k, the OTA transconductance simplifies to 


GH Orv d vA QkI pi 


cari, 2kI¢o(1 a Zio) y Gmo(1 a Xiah) (12) 


where gm1i,2 1s the small-signal transconductance of transistors 
M1 and M2 in Fig. 5 and If, is the mean (dc) value of J;. 
Furthermore, in order to maintain less than 1% nonlinearity, it 


is required that 
lie 
V;| < 0.24/—. 
[Vil < 0.24/-2 


Given the nature of the signals after preamplification as dis- 
cussed and aiming for an output-stage transimpedance gain of 
about 500 kQ, a mean value for G’,, of 185 A/V was chosen. 
Thus, for V; = +85 mV, (12) and (13) can be solved for suitable 
values of k and J ;. 


oe nf (11) 


(13) 
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Fig. 7. Comparator circuit. 


C. Full-Wave Current Rectifiers 


The two full-wave rectifiers shown in Fig. 3 were realized 
by the current-mode circuit in Fig. 6. The upper rectifier (M1, 
M2, MS, M6) operates on current J; stemming from OTA G1, 
while the lower rectifier (M3, M4, M7, M8) operates on current 
I;2 stemming from OTA G’,,2. The core of each current rectifier 
are the complementary transistors M1, M2 (upper rectifier) and 
M3, M4 (lower rectifier), each transistor performing half-wave 
precision current rectification [19]. During positive excursions 
of J;; and J;2, M1 and M4 are turned on and M2 and M3 are 
turned off. Thus, the drain currents of M1 and M4 equal J;; and 
I, respectively, while that of M2 and M3 are zero. During neg- 
ative excursions of J;; and J;2, M2 and M3 are turned on and 
M\ and M4 are turned off. In this mode the drain currents of 
M2 and M3 equal J;; and I;2, respectively, while that of M1 
and M4 are zero. For the upper rectifier, full-wave rectification 
is obtained by mirroring the drain current of M2 through the 
unity-gain pMOS current mirror M5, M6 and adding the mirror 
output to the drain current of M1. Similarly for the lower rec- 
tifier, full-wave rectification is obtained by mirroring the drain 
current of M4 through the unity-gain nMOS current mirror M7, 
M8 and adding the mirror output to the drain current of M3. 
In practice both current mirrors were realized by regulated cas- 
codes [17]. The addition of the various drain currents is done at 
the input node of the current comparator, resulting in the output 
current I, = |J;, —J;2| as indicated in Fig. 6. Although a consid- 
erable voltage drop of about 2 V is generated at the input node 
of each rectifier, the use of regulated cascode mirrors with long 
transistors in the variable-gain OTAs, ensures that J;; and I;2 
are not degraded by channel length modulation. 


D. Current Comparator 


The output currents from the two full-wave rectifiers are 
summed at the input of the current comparator circuit [20] 
shown in Fig. 7 to form current J;. The comparator uses a 
CMOS inverter (M3, M4) to apply negative feedback around a 
class-B voltage buffer (M1, M2). As a result of the feedback, 
the comparator input has a low-impedance (in general) and 
is thus ideal for determining the polarity of J;. On the other 
hand, the output of the inverter does not swing between the 
power supplies and so some static power dissipation is present. 
Fortunately, since in this application low-speed operation is 
required, the inverter transistors can be scaled to minimize 
power dissipation. The buffer transistors have zero dc power 
dissipation. 
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Fig. 8. Large time-constant integrator circuit. 


E. Large Time-Constant Integrator 


Because of the nature of cuff imbalance variations as dis- 
cussed in Section II, the integrator time-constant should be as 
large as possible. System level simulations have shown that a 
time-constant of about 1 s is required for this application. The 
integrator schematic is shown in Fig. 8. The circuit comprises 
three stages. The first stage, consisting of the simple CMOS 
OTA (M1-M4) terminated in resistor R, (2 k{Q), is essentially 
an attenuator which also corrects amplitude variations be- 
tween the comparator peak-positive and peak-negative output 
voltages. This is very important since significant comparator 
output offsets would affect the settling time and SIRout of 
the AT differently for positive and negative Xjmp values. The 
second stage is the actual OTA-C integrator (operated in weak 
inversion), and this consists of a CMOS OTA (M5-—M11) uti- 
lizing transconductance cancellation [21], and an integrating 
capacitor C, (47.5 pF) which is connected across the low and 
high impedance nodes x and y, respectively. The attenuation 
provided by the first stage ensures that the input voltage to the 
second-stage OTA is within its linear range of operation. The 
second-stage OTA is biased to achieve a transconductance Gm 
of 6.9 nA/V given by gme.g X [(n — 1)/(n + 1)], where gme,s 
is the small-signal transconductance of M6 and M8, and n is 
the ratio of the transconductance of M6 to M5 (or M8 to M7). 
Transistor M11 performs level-shifting of the output voltage 
for interfacing with the third stage. 

The third stage (M12 — M17) is another transconductance 
stage converting the voltages across C’} to the differential feed- 
back currents I; and Jf2. The tail currents of the three inte- 
grator stages are provided by the de current sources Ip4, [p5 and 
210. The OTA-C stage, being /ossy, has the following transfer 
function: 


T(s).= Ge 


= ———_ 1 
s2C\ + Jo ‘ 2 


where s is the Laplace operator, and g, is the small-signal output 
conductance seen into node y. The integrator time-constant is 
2C;/go and g, is set by Jy5. Any possible de offset voltages 
across nodes « and y resulting from transistor mismatches may 






(to OTA G,,1) 





Fig. 9. Output-stage amplifier circuit. 


be externally corrected by adjusting the dc voltage level of Ry 
(e.g., by means of current injection). The de voltage source V,-e ¢ 
is the same as in Fig. 4 and the simulated dc gain Gy. /Qo of the 
OTA-C stage is about 73. The integrator described here offers a 
simpler implementation than the design in [22], both addressing 
the same application. 


F. Output-Stage Amplifier 


The schematic of the output stage amplifier is shown in Fig. 9. 
The second output branch from each variable-gain OTA (Fig. 5) 
is hardwired to resistor R; (50 k{.) where the two composite 
currents J;; and J;2 are summed up to form J;. Due to the correc- 
tive action of the control stage, the two EMG components in I; 
and I; are ideally of the same amplitude, and, being anti-phase, 
when added are cancelled out. On the other hand, the ENG com- 
ponents in J;; and J;2 being in-phase, when added a voltage 
is generated across R, which is further amplified. The ampli- 
fier (M1—M9) in Fig. 9 is a standard two-stage op-amp config- 
ured as a noninverting amplifier through the feedback resistive 
network Ry (90 kQ) and R3 (10 kQ). The amplifier employs 
zero-pole compensation realized by the series combination of 
transistor M8 and capacitor C; (3.5 pF), and the circuit is biased 
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Fig. 10. Chip microphotograph. 


TABLE I 
MOS TRANSISTOR DIMENSIONS 


























Circuit Transistor Label WIL (um/um) 
Preamplifier M1, M2 150/10 
(Fig. 4) M3 — M5 20/10 

M6 2/519 
M7 2/371 
M8 77 
M9 5/8 
Variable-gain OTA Mi, M2 150/30 
(Fig. 5) 
Rectifiers MI, M4 200/0.8 
_(Fig. 6) M2, M3 80/0.8 
Comparator M1, M4 Heo $name 
_(ig. 7) M2, M3 2/2 
Integrator M5, M7 10/10 
(Fig. 8) M6, M8 10/9 
M9, M10 25/50 
Mi1 2/30 
M12, M13 200/5 





by the dc current source J,g. The simulated open-loop gain of 
the op-amp is 106 dB, and the input-referred r.m.s. noise cur- 
rent of the complete trasimpedance stage is about 100 pA (band- 
width of 1 Hz—15 kHz). 


V. MEASURED RESULTS 


The adaptive ENG amplifier chip, shown in Fig. 10, was fabri- 
cated in the austriamicrosystems 0.8-4m BiCMOS process [23] 
which includes a high resistive layer. A second chip containing 
the control stage configured as test structures was also fabri- 
cated. The substrates of all transistors were connected to their 
respective power supply rail (i.e., nMOS to Vsg and pMOS to 
Vpp), and the dc bias current sources J; (150 1A), Ip2 (10 WA), 
Ip3 (10 tA), Iba (2 WA), In5 (10 nA), 277, (200 1A), and Tye 
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Fig. 12. Frequency spectrum of the composite input signal. The spectrum of 
the band-limited white noise signal representing the EMG resembles that of the 
real EMG. - 


(50 wA), in Figs. 4, 8, and 9, were realized by an on-chip bi- 
asing circuitry (not described). Some of the key MOS transistor 
dimensions are listed in Table I. In total, 40 chips were fabri- 
cated (20 test structures and 20 complete systems); all showed 
correct operation. 

The input ac signals to the AT chip (DUT) were provided by 
two audio transformers 7, and 7J> (A262A7E) as illustrated in 
Fig. 11. The ac voltage sources, vpma(t) and veng(t), generate 
the EMG and ENG signals, respectively, resistors Ri, Ro, Rs, 
R,4, Rs, and Rx provide attenuation, and the variable resistor 
Rx also generates amplitude imbalance (modeling Xjmp) be- 
tween the EMG terms of the two composite signals across nodes 
ab and cb. Furthermore, resistors R.1,2,3 represent the electrode 
resistances. Initially, the chips were tested with sinusoidal sig- 
nals, Upma@(t) (100 Hz) and vpncg(t) (1 kHz), with nominal 
peak amplitudes across nodes ab (and bc) in Fig. 11 of Vame = 
0.5 mV and VemeG = 1 LV, respectively. Subsequently, in order 
to model a more realistic test, vzama(t) was replaced by an 
arbitrary signal (generated from band-limited Gaussian noise) 
with the frequency spectrum plotted in Fig. 12 (measured across 
ac). The frequency content of this signal varies between 1 Hz 
and 3 kHz, with a peak at approximately 250 Hz, which is the 
case with the real EMG signal [6]. The vpnc(t) was kept in 
all measurements as a sinusoid with the characteristics men- 
tioned above. In Fig. 12, the ENG magnitude (—114 dB) is 
buried under the spectrum floor of the random EMG signal. 
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Fig. 13. System output for +20% imbalance. (a) Time-domain. 
(b) Frequency-domain. 
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(b) Frequency-domain. 


The time-domain tests were monitored on an Agilent 54835A 
Infinitum™ oscilloscope, and the frequency-domain tests on a 
Stanford Research Systems SR760 FFT spectrum analyzer. 
Figs. 13 and 14 show the time-domain and frequency-domain 
outputs of the AT (after settling) for +20% and —40% imbal- 
ance, respectively. The spectra show that the SIR out is better 
than 3 (9.54 dB) even for 40% imbalance. This should be com- 
pared with a SIR;,, of 1/500 (—54 dB). These results show the 
superiority of the AT relative to any filtering technique because 
its operation is not frequency related. The average SIRout for all 
20 (complete) AT chips as a function of imbalance is plotted in 
Fig. 15(a) (Matlab best linear-fit), where it can be seen that even 
for extreme values of imbalance, the mean AT SIR uz is better 
than 2 (6 dB). The error bars in the plot indicate the spread of 
values from all 20 chips. For comparison, Fig. 15(b) shows the 
mean SIRout improvement over the theoretical TT and QT am- 
plifier configurations as a function of imbalance (for the TT the 
input amplifiers were assumed to be matched, and for the QT 
the electrode impedance values listed in the caption of Fig. 1 
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Fig. 15. (a) Mean AT SIRout versus (absolute) imbalance for all 20 chips. 
(b) SIRout improvement over the ideal TT and QT counterparts versus 
(absolute) imbalance. 
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Fig. 16. Settling time of feedback current Jy,(t) for abrupt changes in 
imbalance. 


were assumed). From the plot, it is apparent that the AT sig- 
nificantly outperforms both counterparts in the presence of im- 
balance. Fig. 16 shows the settling time of the feedback current 
I(t) in Fig. 3 for abrupt step-like changes in imbalance. The 
imbalance was changed successively between +32.5%, —5.5%, 
—25%, and —34%. The corresponding settling time (to 1%) is 
about 20 ms per percent change in Xjmp.- 

Finally, in order to test the sensitivity of the AT architecture to 
phase variations, phase shifts were introduced between the two 
input EMG terms to the system (the additional test structure chip 
was used for this test). Fig. 17 shows the SIRout as a function 
of phase shift for both measured and theoretical cases, the latter 
calculated from (9) and for 40% imbalance. The two graphs 
show excellent agreement, but for phase values near the origin, 
the theoretical SIRout tends to infinity, which would never be 
the case for a practical realization. Saline-bath testing of the AT 
chip (not described here) also confirmed its high performance. 
The main design features of the AT chip are summarized in 
Table II. 
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Fig. 17. Sensitivity of AT SIRout to phase shifts. 


TABLE II 
SUMMARY OF PERFORMANCE 







Parameter 





Technology 0.8 um BiCMOS 
dey. 
7.2 mW 
0.68 mm? 
> 6dB 
>+40% 


87dB 














Power supply 





Power consumption 





Active area (core) 

STRout 

Imbalance correction range 

Total ENG path gain 

Setting time (step-change) 
+20 % imbalance 

+40 % imbalance 













480 ms 
960 ms 








VI. CONCLUSION 


The design of an adaptive ENG amplifier for interface 
to tripolar cuff electrodes has been described. The adaptive 
ENG amplifier offers a fully implantable solution to the 
problem of cuff imbalance, thereby significantly advancing 
the state-of-the-art in the field. The described realization over- 
comes many of the limitations of a previous design in terms 
of reliability, cuff imbalance correction range, output SIR and 
output signal distortion. The operation of the circuit has been 
thoroughly verified by tests on 40 fabricated chip samples, 
all exhibiting correct behavior. Although the described adap- 
tive ENG amplifier has been developed for a next-generation 
bladder implant, it can also be seen as a generic high-perfor- 
mance ENG amplifier for any functional electrical stimulation 
application employing tripolar nerve cuff electrodes. 
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Noise-Shaping Techniques Applied to 
Switched-Capacitor Voltage Regulators 


Arun Rao, William McIntyre, Member, IEEE, Un-Ku Moon, Senior Member, IEEE, and 
Gabor C. Temes, Life Fellow, IEEE 


Abstract—A delta-sigma control loop for a buck-boost de-de con- 
verter with fractional gains is presented. This technique reduces 
the tones caused by the traditional pulse-frequency modulation 
regulation. The prototype regulator was fabricated in a 0.72-4m 
CMOS process and clocked at 1 MHz. It achieved suppression of 
tones up to 55 dB in the 0-500-kHz range. The input voltage range 
was 3-5 V. The output voltage ranged from 1.8 to 4 V for load cur- 
rents up to 150 mA. 


Index Terms—Boost, buck, dc—dc converter, delta-sigma, noise 
shaping, voltage regulators. 


I. INTRODUCTION 


MALL electronic devices are commonly powered by bat- 

teries, which allow them to be portable. However, as battery 
use continues, the battery voltage drops, sometimes gradually 
and sometimes suddenly, depending on the type of battery and 
type of electronic device. Such variations in the battery voltage 
may have undesirable effects on the operation of the device pow- 
ered by the battery. Also, the battery voltage may not be optimal 
for the device. Consequently, dc—dc converters are used to pro- 
vide a stable output supply voltage of suitable magnitude from 
the battery to the electronic device. 

For many years, the inductive conversion topology has been 
the standard way to provide a stable voltage from a battery. With 
the continued shrinking of handheld devices such as cell phones, 
PDAs, pagers and laptops, the use of inductive regulators is be- 
coming less attractive. A compact switched-capacitor (SC) reg- 
ulator is preferable to the bulky inductive regulator. SC power 
conversion offers reduced physical volume, less radiated EMI, 
as well as efficiency and cost advantages over inductive based 
structures. A fixed gain SC dce—de boost converter may have a 
gain greater than or equal to one, while a fixed gain SC dc-dc 
buck converter may have a gain less than or equal to one. 

In addition to increasing or decreasing the battery voltage, 
voltage regulation is required to maintain the battery voltage 
at a constant desired value. A conventional method to regulate 
voltage in a SC converter is to use pulse-frequency modulation 
(PFM) or burst-mode operation. These control techniques suffer 
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from tones in the frequency spectrum. The tones are difficult to 
filter out, as their frequencies vary with load and input voltage. 
As aresult, circuits that use the regulated voltage are susceptible 
to tones in the frequency region of operation. Furthermore, these 
tones can mix with unwanted signals outside the band of interest 
and modulate into the desired signal band. 

In this paper, an alternate control technique using a 
delta-sigma loop is presented [1], which spreads the tones 
of the conventional SC regulator. The charge pump used to 
convert the input voltage acts as a D/A converter in the loop, 
and its output ripple is frequency shaped by the delta-sigma 
control loop, which also provides the pulse-frequency mod- 
ulation needed for the conversion. We have applied the new 
control loop architecture successfully to an existing buck-boost 
fractional-gain regulator [2]. We could potentially inject a 
long pseudo-random sequence into the existing PFM loop but 
we then have no control over the PFM part of it. We cannot 
randomly make the regulator “skip” or “pump” based on a 
pseudo-random sequence. We would need some information of 
the output and input (for gain selection between the 7 different 
switch capacitor gains), and that will then introduce tones 
as it will be similar to the PFM type architecture. Using the 
delta-sigma control makes it possible to incorporate the gain 
selection into the control loop, thus providing noise shaping 
along with PFM control in a very small area. The measured 
results indicated that the tones generated by the burst-mode 
regulation circuitry can be reduced by as much as 55 dB by 
embedding the dc—de converter in a delta-sigma loop. This 
verified the usefulness of the proposed scheme. It should be 
noted that the tones are reduced by 55 dB with respect to the 
noise floor of the PFM pump. The noise floor of the regulator 
with the delta-sigma control will be higher, because the total 
noise power remains the same as we do not filter the noise 
shaped spectrum (as done in a conventional delta-sigma modu- 
lator). The idea however is to convert the tones to white noise 
and prevent them from modulating into the audio band. The 
experimental results confirm the validity of the method [1]. 


II]. FRACTIONAL GAIN SETTING CHARGE PUMP ARCHITECTURE 


The block diagram of a widely used burst-mode switched-ca- 
pacitor, dc—de voltage regulator [2] is shown in Fig. 1. The cir- 
cuit contains two feedback loops. One of them is the PFM loop 
which compares the output voltage V.,,. with the desired output 
value Vaesirea, and turns the gated clock signal on or off de- 
pending on the result of the comparison. The other loop per- 
forms gain hopping. It sets the gain G' to a value that it is suffi- 
ciently large to prevent reverse current flow into the battery, but 
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Fig. 1. Burst-mode switched-capacitor dc—de regulator. 
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Fig. 2. Switch array with external capacitors. 


not too large because then the regulator must drop the voltage by 
a large amount, reducing the power efficiency. The gain hopping 
loop requires a fractional gain setting circuit, to be discussed 
next. 

Fractional gains can be realized by connecting external 
capacitors to an on-chip switch array, as shown in Fig. 2 
[2]. The switch array can provide seven different gains 
G = 1/2,2/3,3/4,1,4/3,3/2, and 2. Each gain is imple- 
mented in the two phases of a 1-MHz clock. For example, 
Fig. 3 shows the configuration used to implement G' = 3/2. 

To guarantee that current does not flow into the battery, we 
have to ensure that G > Vana/Vin, where Veprg is the desired 
output voltage, and Vix is the unregulated battery voltage. Also, 
to maximize efficiency, G must be as close to Vazc/Vin as 
possible. The gain that satisfies these conditions is defined as 
the minimum gain Gyn. 

When the pump provides the gain Gy1n, the largest current 
that it can deliver to the load is approximately 


Erie ee (GainVia = View) Rew (1) 


+ Chota 


$ 





“ 


Gain Phase 


Common Phase 


Fig. 3. Capacitor configuration for gain = 3/2. 


where Rout is the equivalent output impedance of the switch 
array. Each gain configuration has a unique Rout, which is 
a function of the switching frequency, capacitor size and the 
switch impedance. Selecting a gain larger than Gin increases 
Imax. By increasing the gain only when needed, power is de- 
livered more efficiently. The gain-hopping loop (Fig. 1) controls 
the gain based on a measure of the load current, and sets the 
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3/4 

2/3 

1/2 
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Input Voltage, V,, (V) 
Fig. 4. Gwin versus Vin (for Vang = 3.3 V). 


value of Gyyin as a function of Vin. Fig. 4 illustrates the min- 
imum gain versus Vix for Vazq = 3.3 V. The gain-hopping 
loop consists of an up-down counter, gain-set block, and a com- 
parator. The up-down counter integrates the pulse sequence at 
the comparator output and directs the gain-set block to increase 
or decrease the gain. 

The PFM loop in Fig. 1 contains a voltage reference Vaesired, 
an analog comparator, and an oscillator. When Vera is below 
the voltage reference, the switch array delivers current to the 
load. Alternately, when Varg is above the reference, the switch 
array rests. By controlling the switching, the output impedance 
is modulated to provide the regulation. Also, for a given gain 
configuration, the pulse density of the comparator is propor- 
tional to [oap. If JLoap is constant, the duty cycle of the 
output is fixed, resulting in a highly tonal frequency spectrum. 


II. MODELLING THE SWITCHED-CAPACITOR REGULATOR 


In order to simulate the regulator at the system level, closed- 
loop expressions must be found for each of the gain configura- 
tions. That helps to predict the time-domain behavior of the reg- 
ulator to a first-order approximation without simulating any real 
circuit components. The expressions that follow are all based 
on the assumption that the switches have zero on-resistance 
Ron. The output impedance of the regulator is a function of 
Ron, Cext, and f (switching frequency). The assumption of Ron 
to be zero in the closed form expression predicts lower output 
impedance for the pump. This is similar to using a larger value 
of C.x4 on the actual regulator. 

A typical time-domain output of a given gain configura- 
tion (G = 1/2) is shown in Fig. 5. The two phases are ©1 
(gain phase) and ©2 (common phase). The four voltages 
Van, Vm» Vmi, and V; at the boundaries of the two phases are of 
importance. Since a constant load Jj,aq was assumed, the values 
of V;,Vm,Vmi, and V; repeat after every cycle in the steady 
state. By applying conservation of charge, one can compute the 
value of the output voltage V,,, sampled at the end of phase 62 
(3): 


Vin Toad (Chola + C) 


Vin coe pf 7 Y 
2. 8fC(2C + Choa) 





; Chola + C 
3C + Chola 
fidsd 


2f(3C + Chora) 





(2) 
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Fig. 5. Ideal time domain response for G = 1/2. 
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Fig. 6. Block diagram of the first-order AD control loop. 


where f = switching frequency and C = Cexti,2.3, as all three 
capacitors are nominally of equal size. Clearly, if Ijoaq is zero, 
the output voltage is Vi, /2, as expected. The above expression 
was simulated in MATLAB and compared with SPICE simula- 
tions. They were found to be in agreement. 

One can also compute V,,(n), the output voltage at the nth 
sample [3] for a time-varying input voltage Vj,,(n): 


Vin(n) = aVin(n — 1) + bVin(n) (3) 
where 
Bo (C ot Chota)” 
(3C + Choa)? 
and 





Cho C 
hold + | (4) 


b= ae | 
< (30 + Chota) 3C ar Chola 


This suggests that the charge pump can be modeled as a lossy 
integrator with a pole at a < 1 and constant gain b. It should 
be mentioned that this model represents the charge pump in a 
single gain setting and does not model the dynamic variations 
between the different gain settings. The key idea is to be able 
to simulate the regulator to a first-order approximation, and to 
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Fig. 7. Discrete time model of the regulator with the A control loop. 
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Fig. 8. Variation of the NTF with feedforward factor J. 


predict the time- and frequency-domain responses without cir- 
cuit-level simulation. 

The efficiency of the charge pump can also be computed. The 
power dissipated at the output, Pout, can be found, as we know 
Vout and Ijgaq. To compute the power P;,, supplied by the bat- 
tery, we need to find the average current delivered by the input 
in each of the gain configurations. Then, the efficiency can be 
obtained from 


ete 0 Voutlout 
Es x Vitin 





n= (5) 


To calculate the average current J;,, supplied by the input, we 
must find the charge supplied by Vj, in every cycle. Since we 
know the value of Vout at the beginning and end of each clock 
phase [3], we can compute the amount of charge transferred and 
calculate the current supplied by Vj, in every cycle. These com- 


putations do not take into account the nonzero switch resistance 
and the power dissipation in the other regulator circuits. The 


predicted efficiency given by the closed form expression will. 


be close to the actual measured results. However, the closed 
form expression does not include the losses due to switching 
of parasitic capacitors associated with the big switches, nor the 
switching losses and I, of the regulator. It is also inaccurate in 
the prediction of the efficiency when the regulator is hopping 
from one gain to another. 
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Fig. 9. Time and frequency-domain output plots for the regulator with and 
without the A¥ control loop. 


IV. DELTA-SIGMA CONTROL LOOP 


As mentioned earlier, the burst-mode (PFM) control mecha- 
nism leads to a tonal spectrum for the output ripple, which may 
introduce excessive noise into the signal band of the device pow- 
ered by the regulator. The tones may be converted into filtered 
pseudo-random noise by incorporating the complete regulator 
as the feedback DAC into a delta-sigma loop, as shown in Fig. 6. 
We assume that the quantization error e[n] can be modeled as an 
additive white noise which is independent of the input, is uni- 
formly distributed in [-A/2, A/2] where A is the step size of 
the quantizer, and has a white power spectral density [4]. Then 
e[n] can be represented as an additional input to the linearized 
system. The output of the modulator Y(z) can be expressed as 


Y(z) = STF(z)U(z) + NTF(z)E(z) (6) 


where STF(z) is the signal transfer function, and NTF(z) is 
the noise transfer function. For the first-order A>: modulator 
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Fig. 10. Delta-sigma control implementation. 


Equation (8) illustrates that if H(z) is a low-pass function with 
a high low-frequency gain, the quantization noise is high-pass 
filtered. 


A. Delta-Sigma Control Loop 


The simplified model of the modified regulator with a delta- 
sigma control loop is shown in Fig. 6. The A loop provides a 
3-bit word necessary for gain selection, plus the 1-bit skip signal 
for the PFM operation. The A¥ loop contains an integrator and 
a 4-bit analog-to-digital converter (ADC). The charge pump acts 
as the digital-to-analog converter (DAC) in the loop. The output 
of the DAC is the regulated voltage. 

The error between the desired voltage and the output voltage 
is integrated and fed to the 4-bit ADC. As the output voltage 
approaches the desired voltage, the error signal decreases, re- 
ducing the input to the ADC. This causes a smaller gain to be 
chosen, until the minimum gain is reached. Since the AX con- 
trol is a first-order loop, dither must be injected to avoid tone 
generation [5], [6]. 

The 3 MSBs from the A/D select one of the seven gain levels, 
and the LSB controls the PFM operation. Since there are seven 
possible gain settings, the 3 bits are sufficient to control all pos- 
sible gains. 


B. Discrete-Time Model of the Delta-Sigma Control Loop 


Fig. 7 illustrates the discrete-time model of the AX control 
loop with the regulator. The delta-sigma loop is a first-order loop 
and by itself it is unconditionally stable. As mentioned earlier, 
the charge pump can be modeled as a lossy integrator which 
creates an additional pole and may make the loop unstable. In 
order to stabilize the loop, a feedforward path was added around 
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Fig. 12. Die photograph of regulator with A® control loop. 
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Fig. 13. Measured output ripple and output spectrum with PWM control and 
A® control for oad = 50 MA, Vout = 3.2 Vand Vin = 3.7 V. 
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Fig. 14. Measured output ripple and output spectrum for PWM control and 
AY control loop for Iicaa = 150 mA, Vout = 3.2 V and Vi, = 3.7 V. 


the integrator with a gain AK. The NTF for the system shown in 
Fig. 8 is given below: 


Vout 


E 
- ae 9) 
~ 1-2z-1[1+a—-(K +1)b] + 2-2(a—- Kb) 





NTF(z) 


II 





where F is the quantization error of the ADC. This is valid for a 
specific value of the input and output voltages and load current, 
and assumes that the system is settled. It does not represent the 
dynamic behavior of the system, but gives a good estimate of 
the stability of the system. We see peaking in the NTF which 
indicates some instability in the loop when the delta-sigma con- 
trol is wrapped around the regulator. 

The NTF is shown in Fig. 8 for different feedforward gains. 
As K increases, the pole-Q reduces, making the system more 
stable. This can be intuitively explained as the feedforward path 


reduces the effect of the delay through the integrator. We have 
not been able to come up with a closed form expression for sta- 
bility for the entire system, but MATLAB simulations indicated 
that adding a feedforward reduces the peaking in the NTF, and 
a feedforward factor (K) greater than 4 does not benefit sta- 
bility. The time-domain output and the output spectrum of the 
regulator with and without the A loop are compared in Fig. 9. 
Both architectures were simulated using the closed-form equa- 
tions [3] (corresponding to the time-domain response of Fig. 5). 
For the simulation Ch,o1q was 30 WF, while Cex+1,2.3 was 0.33 uF 
and V;,, was 5.2 V. The simulated curve matches closely the cal- 
culated NTF. 

As Fig. 9 shows, A® control causes a slightly higher ripple. 
This can be attributed to the increased delay in the loop. How- 
ever, the spectral properties are very much improved: instead 
of high-level tones, the output spectrum contains lower-level 
slightly colored noise, which is much less harmful in most 
applications. 


V. CIRCUIT IMPLEMENTATION 


Since the AD loop (Fig. 6) controls only the gain selection, 
and is not a part of the signal path, it was kept very simple. 
The loop control circuitry is shown in Fig. 10. All the circuitry 
was single-ended since the LSB was large (150 mV). The in- 
tegrator and the gain block were standard switched-capacitor 
stages. The unit capacitance used was 250 fF. A simple two- 
stage Miller-compensated operational amplifier, with an open- 
loop gain of 65 dB, a unity-gain frequency of 17 MHz and a 
phase margin of 55 degrees was used. The ADC/quantizer in 
the delta-sigma control loop was implemented as a conventional 
4-bit flash structure [7]. 

A clocked CMOS comparator was used, as shown in Fig. 11. 
The LSB of the ADC is large, so an inverter based comparator 
could be used. The inverters contain current sources to limit 
the current flow and hence the power dissipation. A resistor 
ladder sets the reference voltage levels. The total resistance of 
the ladder is 220 kQ2. The dither circuit is a pseudo-random 
number generator using flip-flops and XOR gates. The voltage 
reference block consists of a bandgap reference, a D/A con- 
verter and an E7PROM block. This generates the Vaesirea Values 
ranging from 3 to 5 V. The E7PROM allows post-package trim- 
ming of the bandgap voltage and Vag adjustments through the 
DAC. 


VI. EXPERIMENTAL RESULTS 


A prototype regulator incorporating the delta-sigma control 
loop was implemented in a 0.72-;4m CMOS technology. The die 
photo is shown in Fig. 12. The active die area is 2.45 x 3.1 mm?. 
The area of the control loop is 2.45 mm x 0.4 mm. The fabri- 
cated chip was tested through the input range of 3-5 V for sev- 
eral loads and output voltages. Typical measured output ripple 
and spectrum curves for load currents of 150 and 50 mA, an 
output voltage 4.7 V, and input voltage 3.4 V are shown in 
Figs. 13 and 14. The measurement bandwidth was 500 kHz. We 
can see that the PFM control has larger noise spikes at lighter 
loads and lesser spikes at heavier loads. This can be attributed 
to the fact that the PFM control “skips” less at higher loads. For 
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this reason the noise floor of the regulator with delta-sigma con- 
trol is higher in the light loads then at heavier loads (as the total 
noise is not removed). 

The efficiencies of the PFM and A® architectures are plotted 
in Fig. 15, With the delta-sigma control loop the efficiency 
curves are smoother than with the PFM control loop. The AX 
control loop selects a lower gain faster than a traditional PFM 
control loop. However, once the minimum gain has been chosen, 
the efficiencies are comparable for the two architectures. 


VII. CONCLUSION 


A pulse-frequency-modulation voltage regulator with a AX 
control loop was designed and fabricated. The test results 
indicate that the suppression of noise tones is possible using 
this technique. The additional delay through the loop increased 
the ripple and caused slightly poorer regulation, but gave much 
better spectral behavior. 
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A 126-j:W Cochlear Chip for a Totally 
Implantable System 


Julius Georgiou, Member, IEEE, and Christopher Toumazou, Fellow, IEEE 


Abstract—In this paper, a single-chip speech processor/stimu- 
lator is presented for use in a totally implanted cochlear prosthesis 
system. It implements a continuous interleaved sampling (CIS) 
strategy. By combining the speech processor and the stimulator 
into one mixed-signal chip, both size and power are reduced 
sufficiently, so as to make a totally implanted system feasible. 
First silicon has been validated and typically operates at 126 44W 
(excluding cochlear stimulation currents). 


Index Terms—Analog signal processing, cochlear implant, 
micropower, subthreshold. 


I. INTRODUCTION 


HE worldwide deaf population exceeds 70 million, of 

which approximately 600 000 profoundly deaf individuals 
are found in the US and 420000 in the UK. Although conven- 
tional hearing aids provide considerable help for the majority 
of individuals with mild, moderate, or severe hearing loss, these 
aids are of little help where the deafness is profound (average 
loss is greater than about 90 dB SPL in both ears). In such cases, 
an invasive electronic device, i.e., a cochlear implant, has the 
capability to restore hearing to some degree. A cochlear implant 
is used to replace the damaged natural hearing components 
from the eardrum up to the inner hair cells, which transduce 
fluid motion into electrical signals in the nerves. In general, a 
cochlear implant consists of an external speech processor and 
an implanted receiver stimulator; the speech processor picks 
up audio signals and processes these in a suitable manner, 
so as to maximize the benefit for each particular patient. The 
processed signal is then transmitted to the implanted receiver, 
which produces charge-balanced electrical signals to stimulate 
the auditory nerve. This gives a degree of hearing sensation and 
prevents further nerve degeneration [1]. 

Current cochlear speech processors, regardless of manufac- 
turer, are heavily based upon digital technology running DSP 
algorithms on ASIC processors. Although digital technology 
has the advantage of being more flexible to modifications 
through software, there is a high power penalty to be paid when 
the required precision is below 8 bits [2]. As the electrical 
dynamic range of patients’ remaining neurons range between 
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Fig. 1. Illustration of a digital-processor-based state-of-the-art cochlear 
implant system. 


3 and 20 dB, using more than 8-bits precision for the signal 
processing is a massive overkill. With the best state-of-the-art 
digital speech processors, batteries need changing every day or 
two, and most patients, given the choice, would prefer not to 
wear an externally visible processor, although “behind-the-ear” 
(BTE) systems have recently reached the market (Fig. 1.). 
Prior work, to move away from the digital trend and return to 
low-power analog subthreshold systems, has either solved only 
a small part of the problem [3] or not aimed at the application 
of cochlear implants but at modeling the basilar membrane 
[4]-[6]. 

By adopting the best of both the digital and analog worlds, 
a complete system, with sufficiently low power consumption to 
be totally implanted, is presented; digital circuitry is used for ¢o- 
bust communication with the implant, primarily for control pur- 
poses, while low-power analog circuits are used for the signal 
processing. 

A totally implantable system is desired by manufacturers and 
patients alike for the following reasons. 

Improved Aesthetics: A totally concealed cochlear pros- 
thesis can bring significant improvements in self-confidence and 
third-party attitudes, as has been witnessed with “‘in-the-canal” 
hearing aids. Blending in within mainstream educational insti- 
tutions becomes significantly easier for children. 

Reduction of Practical Limitations: A totally implanted 
system will allow the recipients to engage in activities they 
were otherwise unable to do while maintaining hearing, e.g., 
swimming, water-skiing, windsurfing, etc. 

Improved Perception: By having the microphone implanted 
in the canal, the patient can make use of the directional amplifi- 
cation provided by the external pinna, while also reducing noise 
from wind, an effect observed from “in-the-canal” hearing aids. 
The removal of the data rate restriction between the implanted 
part and the external processor allows the use of a higher tem- 
poral resolution, without compromising the number of active 
channels. The positive effect on patient speech recognition of 
increased temporal and frequency resolution is well known [7]. 


0018-9200/$20.00 © 2005 IEEE 





GEORGIOU AND TOUMAZOU: A 126 .W COCHLEAR CHIP FOR A TOTALLY IMPLANTABLE SYSTEM 431 





STIMULATING 


MICROPHONE ELECTRODES 


QR RT LKR IK MM KKK KIN 
SOR ICN ICN BKK KKK KK IK KD 
SSS op that eior ent ieinad EK oh oh Ho hate Morn atin vee ec incon tirthe ded otinn 


OO 
QOS? 
SSeS SCS CS 


- 


<> 


oS 


ANALOG SIGNAL PROCESSING (ASP) CIRCUITS PATIENT 
Ki FITTING AND 
$9) STIMULATION 


CIRCUITS 


ORO 
o 


9 
OOOO 
SOOO OOOO 
ravetateteteterereteeteren 


QS 
res 


KAN a 
Seen BERRI 
DORR KE BOK KID SRR SOS 
SRR KKK KK EKKO RIOD KOR KKK 


5 
ve 
SSE 


2 
Me 
0 


o 
2 


& 
eS 
SIT 


REKKKK 
SKS 
OD 
SRO 
C5 
PEO 


> 
x2 
o 


oO 


5 
2, 


o 
o 


rete. 
- 
<5 


S 
ox 
Se 


OS 


SR 
xx? 


Sees DATA RECOVERY 
% CIRCUITS 


Ce 
re 


o> 


RING 


SOOOOOCOC a. 
oS 


206: 
Eee 


a 
Oo 
S 
oe 


POWER RECOVERY 
BATTERY CHARGE CIRCUIT 
POWER MANAGEMENT CIRCUIT 


IMPLANT PROGRAMMER OR 
CHARGER 





Fig. 2. Block diagram of described cochlear implant system. 


Il. SYSTEM OVERVIEW 


The system consists of a single chip that combines the audio 
processing/stimulation circuits, a rechargeable battery, and a 
second chip containing power management and charging cir- 
cuits. All system components are encapsulated in a hermetically 
sealed platinum case for biocompatibility reasons. A block di- 
agram of the complete system is shown in Fig. 2. Power and 
system settings from the outside world are transferred to the im- 
plant via an inductive link, using a PWM scheme by means of 
an implant programmer or charger. 

The viability of such a prosthesis can be attributed to novel 
electrode designs that reach closer to the auditory nerve endings 
in the cochlea; thus, the overwhelmingly power-hungry stimuli 
of the past have been reduced to consume power comparable to, 
or less than, that used by the speech processor. In addition, the 
sufficient maturing of the cochlear implant speech processing 
algorithms has made the complete reprogrammability of DSPs 
unnecessary. 

This paper will only detail the components of the audio pro- 
cessing/stimulation chip. This chip (diagonal cross-hatching) 
has been manufactured in a 0.8-jm (5 V) process with direct 
portability to a 0.8-j1m, high-voltage (5 and 20 V) process. This 
option is necessary because the upper voltage needed for stim- 
ulation is reviewed as electrode technology develops; the upper 
voltage is simply a function of the maximum comfortable stim- 
ulation current and the maximum cochlear-electrode impedance 
at this current. The impedance can be influenced by how close 
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Fig. 3. Analog functions in a single channel. 


the electrodes get to the neurons, by the materials used, and 
by the surface area. These factors determine the upper voltage, 
and can only finally be determined after clinical trials of novel 
electrodes. 


III. ANALOG SIGNAL PROCESSING 
A. Underlying Technology 


Given the constraint that the voltage of the system is to be kept 
at no less than 4 V for stimulation purposes, reducing power im- 
plies reducing current levels. Hence, the system was designed 
to operate predominantly in the subthreshold region (FET tech- 
nology was necessary for high integration density of the dig- 
ital control and trimming circuits). In the past, the subthreshold 
region has been avoided, as device models were poor, and de- 
vice matching even poorer [8]; the EKV and the BSIM (v3.3 
onwards) models can currently cope quite well with the contin- 
uous modeling of the all the FET operating regions. Similarly, as 
the feature sizes have been reduced, the quality of the gate oxide 
has improved such that, per. unit square gate area, matching has 
also improved [9]. In terms of dynamic range, the subthreshold 
region usually can provide around 60 dBs if carefully designed. 


B. Stimulation Strategy 


As analog systems are not as easily reconfigurable as dig- 
ital systems, the choice of processor stimulation strategy 1s crit- 
ical in making a successful implant system. Various studies have 
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An offchip RC low-pass filter creates a high-pass function when driving a differential input and also produces good single-to-double-ended conversion. 


Xs is the output impedance of the electret microphone that is approximately 4 kQ2. 


shown that the performance of fast continuous interleaved sam- 
pling (CIS) strategies provide better results in comparison to 
other strategies [7], [10], [11], especially those that attempted 
to preprocess speech and extract particular characteristics to 
present to the brain. 


C. Analog Signal Processor Overview 


The analog signal processor consists of two sets of eight par- 
allel channels, whose center frequencies are logarithmically dis- 
tributed in a fashion similar to that employed in the natural 
cochlea. Fig. 3 shows the constituent functions through a single 
channel. The microphone’s output is fed into a voltage-to-cur- 
rent converter (shared by a set of eight channels), which feeds 
the current-mode bandpass filter. The bias current of voltage-to- 
current converter is also used to adjust the input sensitivity of 
the system. An automatic gain control (AGC) circuit regulates a 
current so as to fit the largest audio signals into the 50-dB worst- 
case dynamic range of the filters. Each filter has an ultra-low- 
power clipping detector that consumes a maximum of 80 nW, 
given a 4-V supply. Each clipping detector output is fed to a 
common AGC circuit that reduces the voltage-to-current gain if 
clipping occurs in any of the channels. The attack and release 
times of the AGC are programmable and, if required, the AGC 
can be turned off when manual settings are preferred. A com- 
bined current-limiter/full-wave rectifier function block is placed 
in each channel after the filter. The current limiter is necessary 
in order to cut off large transients that may grow faster than 
the AGC’s response time, hence, protecting the patient from 
uncomfortably large stimulation current pulses. The full wave 
rectifier is necessary for extracting the power in a particular 
audio band. Finally, a combined low-pass filter/compressor/cur- 
rent amplifier stage smoothes out the fully rectified signal and 
compresses it, such that uniform increments in sound levels 
are perceived accordingly by the patient, while also amplifying 
the signal from nanoampere current levels to the microampere 
levels needed for electrical stimulation. 

The signal of each channel is then passed on to the patient 
fitting and stimulation circuits, which maximize a particular 
patient’s comfort and hearing ability, and ensure that only 
one channel’s signal is stimulating neurons at any one time 


according to the CIS strategy. Considerable power savings 
have been achieved by merging blocks and by using inherent 
functions provided by analog components. This will become 
more apparent when the individual circuits are presented. 


D. Input Stage 


1) Circuit Description: The electret microphone deemed 
suitable for this application can roughly be modeled as shown in 
the left half of Fig. 4, with Xs being the series output impedance 
of the microphone; the ac audio signal is superimposed on a 
0.5-V de signal. An off-chip RC low-pass filter is used to bias 
up the differential input to the system and also is used, in 
conjunction with the differential input, to create a high-pass 
filter that will attenuate 50/60 Hz mains pickup. 

The voltage-to-current converter (Fig. 5) allows the transcon- 
ductance to be tuned while still maintaining the same output de 
current level maintained identically to the filter bias currents. 
Variations in the microphone’s dc output level are easily toler- 
ated with this circuit. Large device areas have been used in order 
to bring the flicker noise levels down sufficiently, and to provide 
reasonable matching; (1) and (2) model the drain current stan- 
dard deviation and the flicker noise power, respectively. 


“( JWL 


where A7q,, is an empirical constant supplied for various values 
of overdrive voltage, i.e., Vg — Vro. 


Alp 
Ip 


Td: (1) 


Ky Af 1 


f 


WL 
where K and p are process-dependant constants, W L the active 
area of the device, and Af and f are the bandwidth and fre- 
quency, respectively. 

The transconducting FETs’ aspect ratios were kept such that 
they were well within the subthreshold region to maximize the 
efficiency, i.e., the gm/TJ ratio. The aspect ratio of current mir- 
rors was lowered so as to improve current matching for a given 
current, i.e., by minimizing the coefficient A;,, in (1). The de- 
vice sizes are shown in Table I. 


2 


Ticker (2) 
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Fig. 5. Input transconductor. 


TABLE I 
DEVICE SIZES FOR INPUT STAGE 








Device W b 
M1-M4 80 60 
M5-M6 240 5 
M7-M9 50 120 


M10-M22 10 10 


The system has two independent front-end transconductors, 
each driving a bank of eight logarithmically spaced log domain 
filters. The system was split into the two different bias schemes 
as a method of pre-emphasis and dynamic range extension of 
the higher frequency bands, which have a relatively low energy 
content in speech. The frequency divide between vowels and 
consonants is generally found to be at about 1.2 kHz [7]. The 
higher bias current of the upper filter bank and its independent 
AGC allow for a better signal-to-noise ratio of the higher fre- 
quencies. 

The input stage can be the most power-hungry part of the 
analog signal processing blocks, depending on the settings of the 
AGC. The current Jgain (see Fig. 5) that controls the transcon- 
ductance varies from 10 to 200 nA. 

2) Circuit Performance: At any instant, the dynamic range 
of the input stage, i.e., between the largest signal that will sat- 
urate the following filter and the input stage’s noise floor, is on 
average 45 dB. This “capture window” is moved up or down 
with use of the AGC circuits, which can shift it over 30 dBs for 
the lower eight frequencies and 14 dBs for top eight frequen- 
cies. So the covered audio range for the upper eight frequencies 
is about 59 dB, while the covered audio range for the lower eight 
frequencies is about 75 dB. The worst-case total harmonic dis- 
tortion (THD) figure was measured to be 3.8%, with the input 
stage at its minimum bias and an input signal corresponding 
to 91 dB SPL, which is very loud. More detailed results on 
the input stages THD can be found in reference [12]. Monte 
Carlo simulations predicted that over 99.7% yields should be 
expected, though actual circuit measurements showed that the 
Monte Carlo simulations to be more pessimistic than necessary. 


E. Filters 


1) Filter Design Description: A fully differential scheme 
for the filters was avoided, as subthreshold device matching in 





Vin+ Controlled 
| by AGC 
circuit 


the 0.8-j4m technology used was not sufficiently good to jus- 
tify such a scheme. However, special care was taken to ensure 
substrate noise generated by digital circuits was sufficiently low 
and appropriately isolated; the ground guard separating digital 
and analog circuitry was 700 jm wide and had four bond wires 
attached to provide a low-impedance path for stray substrate 
noise. This solution was opted for, instead of a twin chip solution 
requiring chip-to-chip bonding, as the latter would be spatially 
wasteful within the package and was likely to reduce reliability. 
The extra silicon area used is not particularly important as pro- 
duction numbers are low and the chip costs are negligible in 
comparison to the complete product costs. 

The filter used in this system is a derivative of one of the 
early log-domain filters [13]; log-domain filters [13]-[17] are 
linear when examined at a top level, however, no attempt is 
made to linearize the internal building blocks'. Benefits of 
such methodologies are that the circuits are not limited to 
small-signal operation; in addition, they generally have fewer 
constituent elements and can push a particular technology 
further in terms of frequency and lower voltage. When log-do- 
main filters were originally conceived, they were designed with 
high-frequency/high-power operation in. mind. However, with 
the exploitation of the subthreshold exponential characteris- 
tics [15], [19] for low power, these techniques were applied 
to audio frequency applications where device bandwidths in 
subthreshold can be an issue. Nevertheless, problems have 
been found in low-frequency weak-inversion implementations 
of log-domain filters [20], [21]; these problems are related 
to the presence of multiple de operating points that are not 
present in the bipolar versions, ultimately because the bipolar 
devices have a smaller “triode” operating region. A method 
for eliminating the unwanted operating points with marginal 
additional power consumption was developed by the authors 
to overcome this problem [22] for the cochlea prosthesis. The 
circuit diagram of the single operating point implementation 
is shown in Fig. 6. A signal is input to the filter via device 
M3. The current mode signal is transformed into the voltage 
log-domain via device Mj2 (All the remaining current sources 
are biased. with a constant current.) At nodes v; and vo, the 
nonlinear positive output conductance of the E+ cells is can- 


1The most basic building block element is the exponential, inherent from the 
voltage to current characteristics of bipolar transistors. 








Fig. 6. 
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Filter circuit schematic. All devices are sized 10 jzmx 10 ym. All current values are set at Ip, except those of M2 and M;, which control the filter’s Q 


and which are set at Io + Io /@. The input is put through M3 and the bandpass output taken via M27. 
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Fig. 7. Phase portrait of the original subthreshold filter’s operating states [20], 
[21]. The desired de operating point is Q1. The undesired stable dc operating 
point is Qs. 


celled or reduced by the nonlinear negative output conductance 
of the E— cells. The Q of the filter is therefore controlled by 
bias current My, which regulates the output conductance of the 
input E+ cell and provides for damping. The filtered output is 
provided by device My7 that expands the level shifted voltage 
v;, back into the linear current domain. The state elimination 
circuitry keeps the associated pMOS device off during normal 
operation, during which there is one V,, drop across the positive 
and negative input terminals of the comparator. The negative 
voltage terminal exceeds that of the positive when approaching 
the unwanted quiescent operating point, so current is sunk into 
node v2 to keep operating in the desired region. 


Ver 





Vez 


Fig. 8. Phase portrait of the corrected subthreshold filter’s operating states. 
There is only one operating point, i.e., Qi. 


The circuit’s operating points are found by replacing capaci- 
tors C; and C2 with voltage sources V; and V2, while sweeping 
every combination of voltages, during which currents J; and [> 
flowing through the voltage sources V; and V2 are monitored. 
By using the contour function in MATLAB, one can plot the 
extrapolated zero current contour for J; = 0 and for [y = 0. 
Where the two contours meet is an indication of a de operating 
point, since when there is no current driven into the capacitors, 
the voltages of each capacitor will remain at the same level. 
The direction of the state space trajectories can be obtained by 
using the guiver command whose components are determined 
by J; and J. Fig. 7 shows the filter’s operating points without 
the state elimination circuitry. Fig. 8 shows the single operating 
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TABLE II 
SIMULATED FILTERS THD AT 50% AND AT 95% MODULATION INDEX 








Ipias Filterno Frequency Modulation Simulated 
nA ene eN a Index THD /% 
10 I 300 50% 0.24 
10 1 300 95% a7 
10 8 1250 50% 0.9 
10 8 1250 95% 3.9 
50 9 1580 50% 0.67 
50 9 1580 95% 4.4 
50 16 6450 50% eS 
50 16 6450 95% 4.9 
TABLE III 


MEASURED COMBINED INPUT-STAGE AND FILTERS THD 
AT 90% MODULATION INDEX 








his lin Filter Fo/Hz ee 
10 +10 2 365 3.8% 
10° 10 4 565 3.9% 
10-10 8 1260 3.8% 
5050 9 1580 4.2% 
5050 12-2810 4.5% 
50 50 16 6300 6% 





point achieved once the elimination circuitry is added. More de- 
tails concerning the filter and its stability analysis can be found 
in [12]. 

The circuit was implemented using solely pMOS devices 
(apart from a few nMOS bias current sinks that do not require 
exponential operation) for a number of reasons. First, the 
fact that the pMOS. devices have their own well means that 
the bulk-source voltage can be set to zero, thus simplifying 
the weak-inversion drain-source current expression. The well 
also provides some protection against substrate noise. Second, 
pMOS devices are less noisy and have better current matching 
than their nMOS counterparts in the technology used. 

The filter’s center frequency fo is given by 

To 

fo= QnndrC (3) 
where n is the subthreshold parameter and ¢; is the thermal 
voltage. This expression gives the designer the choice of deter- 
mining the center frequencies using the bias current 9 or the ca- 
pacitance C’. As the weak inversion region has a somewhat lim- 
ited dynamic range, the logarithmic spacing of the eight filters in 
each of the two filterbanks was determined by adjusting the ca- 
pacitor sizes and keeping the bias current consistent so as to keep 
the devices in the optimal subthreshold operating point, hence, 
keeping signals well above the noise/leakage levels and below 
the moderate inversion region. Post-implant fine-tuning, if re- 
quired, can be achieved by adjusting Jp so as to maximize the 
hearing benefit in cases where the electrodes are inadequately 
inserted due to ossification of the cochlea. 

2) Filter Performance: The aim of the two front-end 
transconductors and the eight filters connected to each of them 
is to separate the audio signal into its logarithmically distributed 
frequency bands. Distortion introduced after this separation has 
been conducted is insignificant, so long as the power content in 


TABLE IV 
SIMULATED FILTER INPUT-REFERRED NOISE AND DYMAMIC RANGE 





Input referred 








Toias pees Filter inband rms noise DR 
10 300Hz | 2.8pA 57dB 
10 1.25kHz 8 5.6pA 51dB 
50 1.58KHz 9 13.3pA 58dB 
50. 6.45kHz 16 28pA 51dB 
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Fig. 9. Frequency response of the ninth filter. The measurement was made 
using custom-made V—I and J-V converters. Mains noise harmonics is an 
additional problem in making such measurements. 


the band remains the same. Hence, it makes sense to provide 
measured performance characteristics of the input stage and 
filter working together. Table II and Table III provide simulated 
and measured composite THD figures at the worst-case input 
bias situation spanning across both filter banks. These figures 
are typical of what can be measured for an audio input at 
around 91 dB SPL. With smaller sound signals, hence higher 
Igain bias current, the input stage’s linearity improves. It should 
be noted that measuring such small signals in current mode is 
not straightforward, as commercially available instruments are 
voltage mode. Table IV provides simulated THD results for 
just the filters. In evaluating these figures for the application, 
it is important to have in mind the relative crudeness of the 
electrodes. A significant amount of current spread is inevitable 
since the electrodes are bathing in an ionic fluid. Effectively, 
this causes some electrodes to activate neurons that are meant 
to be stimulated by their neighboring one. 

Depending on the particular filter, the simulated dynamic 
range varies between 51 and 58 dB. In practice (see Fig. 9) 
verifying these to be the filter’s actual dynamic range was 
difficult as the custom-made J—V converter at the output was 
not suitable for measuring ac signals below 100 pA. Given that 
in Fig. 9 the maximum signal was around 10 nA, the minimum 
signal detected is around 100 pA. In addition, mains noise and 
harmonics were difficult to eliminate when measuring such 
small signal levels. 

Clipping detection is achieved by making an extra copy of 
the output current and sinking it into a current source of twice 
the bias current. If the output exceeds this value, the current that 
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Fig. 11. A large input signal illustrates the current limiting feature of the 


current-mode ac signal full-wave rectifier. 


the source cannot sink will drive the node to the positive supply 
level, hence sending a signal to the AGC circuit to reduce gain 
if possible. The worst-case peak power consumption is of the 
order of 80 nW for a 4-V supply while it is nominally 40 nW 
with no input signal. 


F, Current Limiter/Full-Wave Rectifier 


The next two blocks in the signal processing chain shown in 
Fig. 3 will be dealt with together. The bandpass filter’s output 
signal is Class A, i.e., an ac signal mounted on a dc bias. Since 
the filters are single-ended structures, in order to perform full- 
wave rectification it is necessary that both phases are recovered. 
Fig. 10 shows a schematic of the circuit which produces a full 
wave rectified copy of the filtered signal, and hard limits the 
output current to the twice the value of the bias current. The 
output current [}ia, + I,- from the filter sourced to a node A, 
which is also connected to the input of current mirror (Mp—M;,) 
and to a current source drawing 2/},;.,. From KCL it is apparent 
that the current drawn through the current mirror is [pias — Lac. 
The maximum current that can be drawn through this branch 
is limited to 2/,;,, given that the output of the filter can only 
provide unidirectional current. Similarly, the current [pias — Lac 
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Peak rectified output current vs ac input current magnitude at 7kHz 
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Fig. 12. 
at 7 kHz. 


Measured peak rectified output current versus peak ac input current 


is then sourced to node B onto which is connected to a cur- 
rent source drawing 2/),;,, and to the input of current mirror 
(M2—M3) which once again draws Ipias + Jac. The copied cur- 
rent is passed to node C, where a current source of value [pias 
is connected in parallel with a current mirror (M5—Msg). The 
mirror naturally only mirrors the positive phases of —J,.. In 
other words, half-wave rectification of the current —J,,. is car- 
ried out. A similar operation is carried out at node D giving the 
half-wave rectified positive phases of +/,.. By summing the 
two half-wave rectified signals at node E, the full-wave rectified 
audio signal is thus recovered. 

There are a number of device sizing issues associated with the 
full-wave rectifier and current limiter. On one hand, keeping de- 
vices small will save area and allow better rectification of small 
signals at higher frequencies since there is less charge stored 
in the channel, while on the other hand, if the devices are too 
small, this will lead to an unacceptably large mismatch. Mis- 
match can lead to a de output in the absence of an ac signal, or 
alternatively, a minimum level below which nothing is detected. 
The devices that should be kept small, if small signals are to be 
adequately rectified at the higher frequencies, are M;—Mg and 
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Fig. 13. This simple circuit performs low-pass filtering of the rectified signal, 
compression, and current amplification from nA to A. 


Mo—Mjp». Of course, the Jj,;,, currents should be well matched 
as well as devices Mo—Mg4. Fig. 12 illustrates the linearity of 
the rectifier over a couple of decades at 7 kHz. Performance im- 
proves at much lower frequencies than 7 kHz, which is where 
the filters are operating. 


G. Low-Pass Filtering/Compression Amplification 


In the conventional implementation of the CIS strategy, after 
the full-wave rectification comes the low-pass filter, which is 
followed by a separate signal compressor that reduces the dy- 
namic range to fit the patient’s low stimulation range. Classical 
techniques for low-pass filtering at the low frequencies of in- 
terest either consume much area due to large capacitors/resis- 
tors or consume more power when using active components to 
make small capacitors “appear large” via the Miller effect. The 
proposed solution is shown in Fig. 13. 

If we exclude the current mirror (M,—M2) that is used for 
interfacing to the full-wave rectifier, with just three transistors 
and a capacitor, the full-wave rectified signal can be smoothed 
out, compressed, and amplified to stimulation levels. The max- 
imum current input from the current-limited full-wave rectifier 
is 10 nA, which flows in all devices except for Ms, which boosts 
the signal to just under a microampere at maximum. Hence, this 
provides three signal-processing functions at a very low power 
budget. Inaccuracies due to process variations are not impor- 
tant since the patient-to-patient variations are much larger and 
are accommodated in the patient-fitting circuits that follow. In 
Fig. 14, the ac response at a bias of 10 nA illustrates a low cut-off 
frequency of just over 300 Hz with a 40-pF capacitor. The bias 
current supplied to this block is the full-wave rectified signal 
provided from the previous stage and so the cutoff frequency 
fluctuates to lower values accordingly. For example, at 500-pA 
current, the cutoff frequency crawls down to 40 Hz. 

The measured dc response of the circuit is shown in Fig. 15, il- 
lustrating both dynamic-range reduction as well as current gain, 
taking the signal from the nanoampere range to microampere 
levels. Dynamic range compression means this gain is higher 
for smaller signals and lower for larger signals. By sizing the 
devices appropriately, during normal operation all transistors 
(10 xmx 10 pum) are in the subthreshold region except for Ms 
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(2 mx 40 zm). Therefore, it is quite straightforward to derive 
an expression describing the circuit’s dc input-output character- 
istic (neglecting the body effect): 


oe UCoz Ws ew Lis - 
Bey = Ls 2nVr In Waza 7 Pat) Vrx : (4) 
Ls 4 oO 


; 


A number of different compression schemes are utilized in 
cochlear implant processors; it is not imperative that these are 
logarithmic, so it does not matter if the circuit of Fig. 13 does not 
perform a purely logarithmic compression. A more generalized 
form of compressions [23] used in cochlear implants is 


Tout = Ax? + K. (5) 
That concludes the last of the analog signal-processing func- 


tions. The compressed power in each frequency band is then sent 
to the patient-fitting and stimulation circuits. 
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Fig. 16. Schematic of electrode driving circuits. Device Mg has a smaller aspect ratio Mg. Devices M7 and Mj; are used to generate a bias voltage for the 


cascade devices M,¢g—Mz2; to increase output impedance without losing too much voltage headroom. 


IV. PATIENT-FITTING AND STIMULATION CIRCUITS 


A. Patient-Fitting Circuits 


The stimulation current levels required for a patient to just 
about perceive sound (stimulation threshold) varies quite sig- 
nificantly from patient to patient or even from electrode site to 
electrode site within the same cochlea. This greatly depends on 
the number of surviving neurons and the proximity of the elec- 
trodes to the nerves. Similarly, the maximum comfortable stim- 
ulation level also has a large variability. However, in all cases, 
the dynamic range between the hearing threshold and the max- 
imum comfortable level is quite low, ranging typically from 6 to 
20 dB. Hence, good fitting is important if the most is to be made 
of the limited dynamic range of the patient. The fitting circuits 
consist of digitally controlled variable-width current mirrors, as 
shown in Fig. 16. For each channel, 5 bits are allocated to re- 
moving any accumulated offsets from all the ASP circuits, since 
MOS devices biased in the subthreshold operating region have 
poor current-matching characteristics in comparison to identical 
devices operated in strong inversion [12]. Another 5 bits are al- 
located to setting the threshold of hearing, while another 6 bits 
are allocated for a multiplicative constant that takes the max- 
imum allowable current level, leaving the full-wave rectifier and 
smoothing circuits, to the maximum comfortable stimulation 
level. 

The same offset removal mechanism can also be used to re- 
duce the sound window’s capture dynamic range in noisy envi- 
ronments. In the highly successful n-of-m stimulation strategy, 
only the n strongest frequency bands, of a total of m separate 
channels, actually stimulate the cochlea. Once the offset is re- 
moved, the threshold of hearing is set via a second dc current 
source. Any ac power detected is added to this to give hearing 
sensation. Since the maximum current is limited at an earlier 
stage, the gain at the output is programmed such that at max- 
imum input volume, the stimulus does not exceed the patients’ 
comfortable hearing levels for each particular frequency. 


B. CIS Biphasic Pulse Generation 


The continuous interleaved sampling (CIS) generator is the 
last of the signal conditioning blocks that directly interfaces 
with the electrodes, via blocking capacitors. The CIS generator 
converts the output of the patient dynamic range mapping cir- 
cuits into nonoverlapping biphasic pulses. 

A top-level block diagram of the CIS generator is shown in 
Fig. 17. As there are 16 channels in the system, there are 16 
output driver cells making up the CIS generator, however, only 
the first two and last two cells are shown. All the intermediate 
cells are identical, while the first and last cells differ slightly. 
The three different cells are shown in Fig. 18. Starting from the 
front cell, assuming there is no busy signal output from any of 
the following 15 cells or from within itself, the first cell will 
activate itself by generating a pulse with the three input NOR gate 
driving the first D-flip-flop input. A clock period later, the pulse 
will propagate to the output of the first flip-flop, which will turn 
on switches such as to provide a current path via M to electrode 
A, back through electrode B, and down Mz to ground. The first 
flip-flop is high so the three-input NOR gate will not produce 
another pulse as its output. After another clock cycle, the pulse 
will propagate on, flip-flop down, and reverse the direction of 
the current through the electrodes for another period. On the 
next clock pulse, a middle cell is activated and propagates the 
pulse in a similar fashion, first through itself and then down the 
remaining 13 cells, until it activates the last cell. This works in 
a similar fashion but has an extra flip-flop added to it so that it 
can provide an extra pulse that shorts all electrodes to ground, 
so as to remove any residual charge. This is required to make 
absolutely sure that no dc charge accumulates on the blocking 
capacitors, reducing voltage compliance. If blocking capacitors 
are not used and de charge accumulates on the electrodes elec- 
trolysis may occur, corroding the electrodes and producing toxic 
materials, e.g., 2Cl” ions could be turned into Clg gas! 

The clock used to drive the CIS generator is obtained from a 
simple RC relaxation oscillator shown in Fig. 19. This consists 
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Fig. 17. Top-level view of the CIS generator circuit (only first two and last two channels). 
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Fig, 18. The first, middle, and last cells making up the CIS stimulation generator are shown. By using a modular design, any number of channels can be easily 
assembled. 


of three inverters, a capacitor, and a digitally controlled resistor V. AUXILIARY CIRCUITS 
that is used to adjust the frequency of oscillation. The frequency A. Power and Data Transfer 


of oscillation directly affects the pulse width and the refresh rate Power is sourced to the implant via an inductive link. The 


of each channel. same inductive link is used to convey digital data to setup the 
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Fig. 19. Simple RC oscillator used to drive biphasic pulse generator. 


V<V, 





Fig. 20. Peaking current reference circuit. 


system for a particular patients needs. The electromagnetic wave 
is pulse-width modulated in a similar fashion to that described 
in [24] which maintains a constant flow of power and data. A 
detailed description about the data recovery circuits and system 
setting can be found in [12], [25]. 


B. Reference Circuits 


The reference circuits in a totally implanted system do not 
have to be particularly accurate but should remain consistent be- 
tween the implant fitting/adjustment sessions so as not to over- 
stimulate or understimulate the patients’ neurons. Overstimula- 
tion results in exceeding the maximum comfortable level and 
can cause irreversible damage to surviving neurons, while un- 
derstimulation does not make use of the already limited neural 
interface dynamic range. The implant is subjected to virtually 
no temperature variations, while the current design’s power dis- 
sipation in of the order of microwatts and so does not affect the 
temperature within the casing. 

The circuit chosen for the cochlear implant current reference 
is shown in Fig. 20. It is a low-current implementation of 
the peaking current reference source [26], [27]. The circuit 
achieves some degree of supply insensitivity by current feed- 
back. Assuming Vaq rises, due to the finite output resistance of 
Qo, Iga rises too. This increase in current is mirrored and driven 
through resistor R. To accommodate the change in current, 
Vbeqi will increase logarithmically, while the voltage drop 


across the resistor will increase linearly. This causes V},.q2 
to decrease, countering the initial increase in current, due to 
increase in supply voltage. It is quite simple to show that if 


(W/L), = (W/L), (6) 


then 


kT 1) (7 
Fins nm. ) 


Tout =“ 


Letting m = 8 and R = 243 k? gives an output current of 
roughly 27 nAs. This simulates to about 30 nAs due to sec- 
ondary effects (e.g., the vertical npn transistors have an Early 
voltage of about 30 V). The diode connected to Q1 is normally 
reverse biased except in startup conditions. The bias voltage for 
the startup diode was generated using a string of diodes con- 
nected between the supplies. Enough of them were used so as 
to ensure very little power loss. The current ranges from 270 pA 
to 1 nA when the supply varies from 3.8 to 4.2 V. The total power 
of the circuit including the outputs stands at 1.8 .W at 4 V. 

The 1.1-V reference voltage required for the microphone 
supply is made by forward biasing a couple of diodes. with 
around 10 nA of current to generate the voltage, and then 
buffering it. An off-chip capacitor is used to reduce the noise 
of this supply. Jou: showed that it has 0.8% variation over 
supply voltage while V,.¢ has a 0.04% variation over the supply 
range. In terms of manufacturing variability, the circuit has 
an 11% standard deviation mainly due to resistor tolerances, 
but is digitally adjusted back to the nominal value. The digital 
adjustment is required in any case since inadequate electrode 
insertion requires the bias currents to be adjusted to compensate 
for this. 


VI. THE OVERALL SYSTEM 


The complete system fits on a die 3.5 mm x 6 mm; a photo- 
graph of the completed chip is shown in Fig. 21, along with a 
layout map. The top end of the chip contains the lowest power 
and noise components, while the noisiest and highest power 
circuit elements are placed on the bottom. A wide p* guard 
separates the predominantly digital circuitry from the analog 
circuitry. Six different supply pad pairs were used; for either 
half of the chip, a low-noise analog supply and separate dig- 
ital supply was used. The fifth supply was used for the substrate 
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Fig. 21. Photograph of chip measuring 3.5 mm x 6 mm and layout plan. 
biasing/guard ring network of the low noise upper half of the 
chip; as the largest area is taken up by interleaved capacitors 
pairs, these were individually shielded from substrate noise by 
placing ann~ tub in the p-substrate beneath each one. This was 
connected to the positive bias supply. The last pair of supply pins 
provides power to the settings registry. When the battery supply 
is low, all the other supplies are cut off to prevent complete dis- 
charge. The static power consumption of the registry circuits is 
extremely low, in the order of femtoamperes. In the event that 
the registry power is completely cut off, on power up the system 
resets the registry’s contents to ensure that the system comes up 
in a safe state, i.e., all outputs are set to zero. 

In order to aid the testing of the cochlear system-on-chip, a 
dedicated test board was constructed. On the test board, a PIC 
microcontroller was used to send Hamming PWM encoded sig- 
nals to control the settings on the chip. The biphasic current 
outputs of the stimulation circuits drove a resistor of similar 
impedance to that of a real cochlea, via a series blocking ca- 
pacitor. The voltage developed across the resistor was amplified 
and sent to one of the PIC’s A/D converters. As we can only 
monitor eight out of the 16 channels at any one time, the chan- 
nels were split into odd and even channels with the use of dip 
switches on the test board. Fig. 23 shows the PC interface used 
to provide the settings on the cochlea chip. At the bottom, the 
resulting spectrogram is created by a log audio sweep, ranging 
in frequency from 10 Hz to 10 kHz. The intensity represents 
the magnitude of the current output pulses above the patients’ 
threshold of hearing. The pulsating at the lower frequencies is 
due to the input signal frequency being comparable to that of the 
CIS output frequency. Table V contains a summary of the key 
features and performance characteristics. 

The total power of the chip was measured to be 126 .W at 
4 V, not including the power dissipated by the biphasic pulse 
stimulus. Assuming a battery capacity of 10 mA at 4 V on one 
charge, the circuit will be powered for about 13 days. Assuming 
that with the next generation electrodes that the stimulus is on 
average 500 j.A in the constant presence of sound, then the total 
power will be 2.126 mW, so the same battery will last at least 
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Fig. 22. Illustration of biphasic pulses output from channels 2, 6, 10, 14 of the 
whole system. The pulses were sent through a blocking capacitor and a resistor. 


TABLE V 
FEATURES AND PERFORMANCE SUMMARY 


General Characteristics 










Process 

Die Area 

Number of channels 
Power consumption 
AGC time constants 
(optional AGC and externally program. t’s ) 


AMS 0.8m CXZ 

3.5mmx6mm 

2x8 (logarithmically distributed) 
126uW (excl. electrode stimuli) 





















Tattack < SMS, Tretease < 120ms, 


























Min 
Su 3.8V 4.2V 
Sound Pressure Level Range 30dB SPL 


(noise free and clipping free) 


90dB SPL 
Input de voltage levels OV Vdd-500mV 


Analog Signal Processing Characteristics 












Input Stage Dynamic Range 
AGC Dynamic Range 


Filter Dynamic Range 
Filter Center Freq. Tuning 


Smoothing Filter max fc - 


Compression Characteristic 











Output Characteristics 








Stimulation strategy 
Stimulation | 
Hearing threshold resol. 
Max comf. hearing resol. 























Sbits (Max 100A) 
6bits (Max 600A) 
Min ____Max 
0 700uAs 
50 usec/phase} 100ptsec/phase 































Stimulation current 


Pulse width 
(Externally programmable to 4 levels 









18 hours and 48 mins which is quite reasonable, assuming that 
the implant is recharged during the patient’s sleep. 


VII. DISCUSSION 


The trend toward complete digital systems on chip has been 
re-examined to find that a hybrid digital analog system can save 
much more power for this particular application. In the above- 
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Fig. 23. Cochlear chip controller window interface. At the bottom is shown a spectrogram produced by a log frequency audio sweep between 10 Hz and 10 kHz. 


described system, the power levels have been reduced from mil- 
liwatt levels that are currently used in cochlear chips to the mi- 
crowatt range. A proof-of-concept design has been shown in 
solid-state form, however, before taking a system like this into 
production there is still much work to be done, e.g., in the areas 
of long-term reliability, patient safety through tlinical trials, etc. 
Nevertheless, the design is based on existing successful cochlear 
implant processing strategies and so patient performance results 
are not expected to differ significantly. 
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A 375 x 365 High-Speed 3-D Range-Finding 
Image Sensor Using Row-Parallel Search 
Architecture and Multisampling Technique 


Yusuke Oike, Student Member, IEEE, Makoto Ikeda, Member, IEEE, and Kunihiro Asada, Member, IEEE 


Abstract—A high-speed three-dimensional (3-D) image sensor 
for a 1000 range maps/s 3-D measurement system based on a light- 
section method is presented. It employs a row-parallel search ar- 
chitecture to achieve a high-speed frame access rate for the detec- 
tion of activated pixels on the focal plane. The row-parallel search 
operation is carried out using chained search circuits embedded in 
a pixel. Moreover, we propose a row-parallel address acquisition 
technique using a bit-streamed column address flow. Row-parallel 
processors receive the bit-streamed column address and calculate 
the center position of activated pixels. The pipelined operations 
enable a multisampling technique that improves the resolution of 
pixel detection. A 375 x 365 3-D image sensor using the present 
architecture has been designed in a one-poly five-metal 0.18-j1m 
standard CMOS process and successfully tested. It attains a frame 
access rate of 394.5 kHz with four samplings, which corresponds 
to 1052 range maps/s. The multisampling operation improves the 
sub-pixel resolution to around 0.2 pixels and achieves a range ac- 
curacy of less than 1.10 mm at a target distance of 600 mm. 


Index Terms—CMOS image sensor, high range accuracy, high 
speed, light-section method, multisampling method, range finder, 
row parallel architecture, 3-D image sensor. 


I. INTRODUCTION 


HIGH-SPEED and_ high-resolution three-dimensional 

(3-D) imaging system has a wide variety of applications 
including gesture recognition, depth-key object extraction, 
position adjustment, computer vision and security systems. 
In recent years, we have often seen 3-D computer graphics in 
movies and televisions and handled them interactively using 
personal computers and video game machines. Moreover, 
ultra-high-speed range finding provides the possibility of ad- 
ditional applications such as shape measurement of structural 
deformation and destruction, quick inspection of industrial 
components, observation of high-speed moving objects, and 
fast visual feedback systems in robot vision. 

Some 3-D range-finding image sensors have been presented 
for 3-D imaging applications based on the stereo-matching 
method [1], [2], the time-of-flight method [3]-[7], and the 
light-section method [8]-[13]. The stereo-matching method 
provides a simple system configuration with two or more cam- 
eras. The stereo-matching processing, however, requires a huge 
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Fig. 1. Triangulation-based light-section range finding system. (a) System 
configuration. (b) Relation between range accuracy and beam position on the 
focal plane. 


computational effort with a high pixel resolution, and the range 
resolution and accuracy depend on target surface patterns. It 
is also difficult for the time-of-flight method to provide high 
range accuracy due to the limitations on the phase detection 
speed of a pulsed light. On the other hand, the light-section 
method is capable of high-accuracy range finding and it is most 
suitable for precision shape analysis. A typical configuration 
of light-section range finding is shown in Fig. I(a). A sheet 
laser beam is projected and scanned on a target object. An 
image sensor detects the positions of the reflected beam. on 
the sensor plane. 3-D range data are calculated by the beam 
projection angle a, and the beam incidence angle a; based on 
triangulation as shown in Fig. 1(b). The beam incidence angle 
can be acquired by the position of the incident beam on the 
sensor. Therefore many frames are necessary for a 3-D range 
image during the beam scanning. For example, a 1000 range 
maps/s 3-D measurement system with a practical pixel reso- 
lution requires over 100-kHz frame access rate. It is difficult 
for conventional image sensors to realize such a high-speed 
frame access. Fig. 2 plots the trend of range finding speed and 
pixel resolution in the state-of-the-art high-speed image sensors 
[14], [15] and light-section 3-D range finders [10]-[13]. It also 
shows examples of high-speed range finding applications. The 
conventional 3-D range finders have achieved 40-50 kHz frame 
access rate for real-time 3-D imaging. However, the target area 
of 1000 range maps/s requires around 400-kHz frame access 
rate. Therefore, we have presented a concept of a row-parallel 
search architecture on the focal plane and demonstrated the 
possibility of 1000 range maps/s range finding with a practical 
pixel resolution [16]. 

This paper presents a 3-D image sensor with 375 x 365 pixels 
for a 1000 range maps/s 3-D measurement system based on the 
light-section method, which was reported in part at the IEEE 
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Fig. 3. Frame access methods. (a) Raster scan. (b) Row-access scan. 


(c) Row-parallel scan. 


ISSCC 2004 [17]. The row-parallel search architecture is imple- 
mented in three pipelined stages with a new multisampling func- 
tion. The separated stages of photo integration, position detec- 
tion, and data readout enable a high-speed frame access rate with 
multiple samplings. The multisampling technique improves the 
sub-pixel resolution of position detection on the focal plane for 
high range accuracy. 

Section II presents the concept of a row-parallel search archi- 
tecture. Circuit configurations and operations are described in 
Section III. Section IV introduces the multisampling technique 
with theoretical estimation of the improved sub-pixel resolu- 
tion. Then, Section V shows the chip specification of a designed 
3-D image sensor. The measurement results are discussed in 
Section VI. Finally, Section VII concludes this paper. 
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Row-parallel position detection architecture implemented on the sensor 


II. ROW-PARALLEL POSITION DETECTION ARCHITECTURE 
A. Concept of Row-Parallel Search Architecture 


Conventional image sensors typically employ a raster scan 
method or a row-access scan method. The raster scan method 
accesses all the pixels sequentially for a few activated pixels 
on the focal plane as shown in Fig. 3(a). The row-access scan 
method also needs to access all the pixel values. In row-access 
image sensors such as [11]-[13], the activated pixels in a row 
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Fig. 6. Simplified block diagram of 4 x 4 pixels. 


line can be scanned and detected in a column parallel fashion 
as shown in Fig. 3(b). Therefore, the row-access scan method is 
more suitable for high-speed position detection than the raster 
scan method. Fig. 4(a) shows the position detection flow of the 
row-access scan method. First some pixels are activated by a 
strong incident beam. Then the pixel values in a row line are read 
out. The activated pixels are scanned and detected in column 
parallel. The left and right edge addresses of consecutively ac- 
tivated pixels are acquired. If another incident beam exists in 
the row line, the search and address encoding operations are re- 
peated. After that, the next row line is accessed and the pixel 
values are read out again. The access and search operations are 
repeated in proportion to the number of row lines. The access 
rate, limited to about 50 kHz, becomes the bottleneck. 

Fig. 3(c) shows the proposed row-parallel scan method on the 
focal plane. In the row-parallel scan method, activated pixels in 
every row line are simultaneously scanned in row parallel. Then 
the addresses are acquired also in row parallel. Therefore there is 
no access iteration in proportion to the pixel resolution as shown 
in Fig. 4(b). 


B. Block Diagram of Row-Parallel Scan Sensor 


The present row-parallel architecture is implemented on the 
sensor plane as shown in Fig. 5. The row-parallel search op- 
eration is carried out by a chained search circuit embedded in 
each pixel. Search signals are provided from the left part of the 
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sensor. They propagate from one pixel to the next pixel one after 
another via the in-pixel search circuit in a row parallel fashion. 
Then, the search propagation is interrupted at the first-encoun- 
tered active pixel in each row line. In terms of address acquisi- 
tion, it is impractical to implement an address encoder in every 
row line since a regularly spaced array structure is necessary for 
an image sensor. If a standard address encoder is implemented in 
each pixel, it requires many transverse wires per row as well as 
a large circuit area per pixel. We propose a bit-streamed column 
address flow for row-parallel address acquisition that enables 
compact circuit implementation. Column address streams are 
injected at the top part of the sensor in column parallel, and 
change their directions at pixels detected by the search circuits. 
The address acquisition scheme requires just one vertical wire 
per column and one transverse wire per row, which is suitable 
for a high-resolution pixel array. Each pixel includes a photo 
detector, a 1-bit A/D converter, a search circuit, and part of an 
address encoder. 

Fig. 6 shows an overview of the row-parallel scan image 
sensor simplified to 4 x 4 pixels. It consists of a pixel array, 
bit-streamed column address generators at the top, row-parallel 
processors with data registers and output buffers on the right, a 
row scanner on the left, and a multiplexer at the bottom. These 
components are controlled by an on-chip sensor controller with 
a phase-locked loop (PLL) module. Pixels in a row line are 
connected with neighbor pixels by a search signal path. Column 
address streams are provided from the address generators to 
each vertical wire. Then the bit-streamed address signals are 


OIKE et al.: RANGE-FINDING IMAGE SENSOR USING ROW-PARALLEL SEARCH ARCHITECTURE AND MULTISAMPLING TECHNIQUE 447 


part of. 
address encoder 








search mode 


a nv SCH 
switch circuit : 


photo detector 


F rRSW 
* Vest probe 






“fot -bit A/D TS hy 
SEL | w/ data latch chained search circuit 
pixel value 
= readout circuit 





















Fig. 7. Schematic of a pixel circuit. 
‘search :data; search address ;search address  : 
‘refresh ilatch: time encoding : time encoding | 
\ ‘ (for left edge) (for rightedge) _: 
ae ad > 
See RT integration time 
RST ! Se ode ile tee SL ee 
ch : pixel activation 
LSW 
RSW } 
SCHo | 
SCHi | 
SCHi +] 3 
SCH": 
SCHr } 
Pt \ address address 
ADDj * ee left edge) 77 K ign edge) 
row-parallel address acquisition w/ center calculation 
<< —— 
row-paralle! row-paralle!: 
processing processing | 
TR eeeeee position data output eeeeee 
| < data transfer to output buffers 
1 access cycle for beam position detection 
Fig. 8. Timing diagram of a pixel circuit. 


injected to horizontal wires at the detected pixels. The row-par- 
allel processors receive the bit-streamed address signals and 
the search completion signals from the right pixels in each row. 


Il. CIRCUIT CONFIGURATION AND OPERATION 
A. Pixel Circuit Configuration 


Fig. 7 shows the pixel circuit configuration with row-parallel 
position detection functions. It consists of a photo detector with 
a reset circuit, a 1-bit A/D converter with a latch circuit, a pixel 
value readout circuit, a search mode switch circuit, a chained 
search circuit, and part of an address encoder. The voltage Vpa 
is set to areset voltage V,..4 by RST. The 1-bit A/D converter re- 
ceives V,q and determines the pixel value. The voltage V,,q be- 
comes a low level in case of an active pixel with strong incident 
intensity. Therefore, it provides “0” for an active pixel value, and 
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“1” for an inactive pixel value. A transistor biased by V;, reduces 
the short-circuit current and controls the threshold level of A/D 
conversion. The pixel value readout circuit provides a binary 
image for functional tests. The search mode switch circuit and 
the chained search circuit are devoted to a row-parallel search 
for activated pixels. The address encoding section connects a 
column address line with a row address line. The row-parallel 
search and address acquisition functions are described in detail 
in the next sections. 


B. Row-Parallel Search Operation 


The row-parallel search operation is carried out using a 
chained search circuit embedded in each pixel. First, it detects 
the left edge of consecutively activated pixels in each row. 


-Fig. 8 shows a timing diagram of the pixel circuit. Fig. 9 shows 


the procedure of the row-parallel search for activated pixels. 
The search mode switch circuit, which is implemented by 
a pass-transistor XOR, provides a control signal CTR for the 
search circuit. For the left edge detection, LSW and RSW are 
set to a high level and a low level, respectively. As the result 
of pixel activation, the activated pixel values are “0” and the 
others are “1”’ as shown in Fig. 9(a). A search signal SC Ho 
is provided to the left pixel in each row line. It passes through 
inactive pixels one after another via the in-pixel search circuits 
since the control signal CTR is set to a high level. The search 
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signal propagation is interrupted at the first-encountered active 
pixel as shown in Fig. 9(b), that is, it detects the left edge 
of consecutively activated pixels. After row-parallel address 
acquisition, LSW turns OFF and RSW turns ON. All the pixel 
values are inverted for the right edge detection as shown in 
Fig. 9(c). Namely, the active pixel values change to “1” and the 
interrupted search signal immediately starts again from the left 
edge. It passes through active pixels one after another and then 
stops at the next pixel of the right edge. 

The worst delay of the search operation is the signal prop- 
agation delay through all the pixels in a row line. Therefore 
the search clock cycle is determined by the worst-case delay. 
The center position of incident beam can be calculated by the 
left and right edge addresses. The number of search cycles is 
the same regardless of the number of consecutively activated 
pixels. If another activated pixel exists on the same row, all the 
pixel values can be inverted again by switching LSW and RSW. 
The search operation restarts from the detected right edge to the 
next left edge. Therefore the row-parallel search operation is ca- 
pable of position detection for multiple incident beams due to 
the search continuation. The last search signal SC'H,, from the 
right pixel indicates whether no activated pixel exists in each 
row as a search completion signal. 


C. Row-Parallel Address Acquisition 


Fig. 10 shows a bit-streamed column address flow for row- 
parallel address acquisition. A column address line is connected 
to a row address line by part of an address encoder in the de- 
tected pixel. The row-parallel address acquisition needs just 2 
pass transistors in a pixel as shown in Fig. 7. At the detected 
left edge, SC'H, from the previous pixel becomes a high level, 
but the next search signal SC'H,,, is still a low level since the 
search signal propagation is interrupted. Therefore, both inputs, 
SCH, and SCH; 1, are set to a high level at the detected pixel. 
A bit-streamed address signal is then provided from a column 
address line to a row address line via the two pass transistors. 
The column address streams never conflict with each other in 
the same row line since the left or right edge is detected by the 
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row-parallel search in each row. The bit-streamed address sig- 
nals are injected from the LSB to the MSB, and then they are 
received by the row-parallel processors. 


D. Row-Parallel Processing 


The range-finding image sensor has row-parallel processors 
that receive bit-streamed address signals ADD; and search 
completion signals SC H375 in each row. Fig. 11 shows a 
schematic of the row-parallel processor. It consists of a selector 
with a signal receiver, a full adder, 18-bit registers, 18-bit output 
buffers, and data readout circuits. The selector switches the pro- 
cessing functions, which are an address acquisition mode and 
an activation counting mode. Fig. 12 shows a timing diagram 
of the row-parallel processor. A bit-streamed address signal is 
received by a low-threshold inverter because the address signal 
cannot swing to the supply voltage due to pass transistors in a 
pixel. In a multisampling operation, the row-parallel processor 
counts the number of usable pixel activations by the search 
completion signal since an occasional search operation in- 
cludes no activated pixel. The address acquisition mode and the 
activation counting mode are switched by MLT. The left edge 
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Fig. 13. Sub-pixel center position detection by multisampling method. 
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address is stored in the registers. Then the right edge address 
is accumulated on the left edge address by CK,. and CK, in 
sequential order from the LSB to the MSB. ENB is employed 
to disable the input of the full adder for carry accumulation 
in a multisampling operation. The accumulated address rep- 
resents the center position of activated pixels. The results are 
transferred to the output buffers by TR, and then they are read 
out by SEL; during the search operations for the next frame. 
The row-parallel processing is executed concurrently with the 
row-parallel address acquisition. The row-parallel processor 
has the capability to perform a multisampling operation due to 
the high-speed position detection. 


IV. MULTISAMPLING POSITION DETECTION 


Three-dimensional range data is calculated by the beam pro- 
jection angle a, and the incident angle a; as shown in Fig. 1(b). 
The incident beam angle a; is provided from the incident beam 
position on the focal plane. Therefore, the range resolution and 
accuracy depend on the resolution of position detection on the 
sensor. In other words, the sub-pixel resolution efficiently im- 
proves the range accuracy. A multisampling technique is imple- 
mented to acquire the intensity profile of incident beam for a 
fine sub-pixel resolution. 

In a multisampling method, all the pixel values are updated 
repeatedly during the photo integration. Pixels with stronger 
incident intensity are activated faster and found many times in 
multiple samplings as shown in Fig. 13. In the conventional 
single sampling mode, the acquired data are binary, and so 
the sub-pixel resolution of calculated center position is 0.5 
pixels as shown in Fig. 13(a). On the other hand, the number 
of samplings represents the scale of the intensity profile as 
shown in Fig. 13(b). Some scales provide a fine sub-pixel 
resolution of center position detection for range accuracy 
improvement. Fig. 14 shows a theoretical estimation of the 
sub-pixel resolution as a function of the number of samplings. 
A gaussian distribution is assumed as the beam intensity profile. 
The sub-pixel resolution is efficiently improved in 2-8 sam- 
plings. For example, a 4-sampling mode attains 0.2 sub-pixel 
resolution. 
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Fig. 14. Sub-pixel resolution as a function of the number of samplings. 
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Fig. 15. Die microphotograph and pixel layout. 


TABLE I 
CHIP SPECIFICATIONS 





Process 1PSM 0.18 um CMOS process 
Die size 5.9mm x 5.9 mm 

Resolution 375 x 365 pixels 

Pixel size 11.25 wm x 11.25 um 

Fill factor 22.8 % 

Pixel configuration 1 PN-junction PD, 24 FETs / pixel 
Total FETs 3.74 M transistors 


V. CHIP IMPLEMENTATION 


A 375 x 365 3-D range-finding image sensor using the 
present row-parallel architecture has been designed and fab- 
ricated in a 0.18 ym standard CMOS process with 1-poly-Si 
5-metal layers. The die size is 5.9 mm x 5.9 mm. Fig. 15 
shows a chip microphotograph and a pixel layout. The sensor 
consists of a 375 x 365 pixel array, a column-parallel address 
generator, and row-parallel processors with 18-bit registers and 
output buffers. A row scanner and a column multiplexer are 
also implemented to acquire a binary 2-D image for test. The 
row-parallel operations are executed by an on-chip sensor con- 
troller with a PLL module. The implementation requires 3.74 
million transistors. The supply voltage is 1.8 V. The pixel size 
is 11.25 wm x 11.25 «xm with 22.8% fill factor. It consists of 
a PN-junction photo diode and 24 transistors. The photo diode 
is composed of n*-diffusion and p-substrate. It is split into 
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Fig. 17. Cycle time of activated pixel search and data readout. 

several rectangular slices to improve the sensitivity since the 
present CMOS process has no option of silicide layer removal. 
Table I shows the chip specifications. 


VI. MEASUREMENT RESULTS 
A. Frame Access Rate 


The row-parallel position detection is pipelined in three 
stages on the sensor as shown in Fig. 16. The first stage is 
the photocurrent integration for pixel activation. The second 
stage is the row-parallel operation of activated pixel search and 
address acquisition. The last stage is the data readout operation 
from output buffers. The photocurrent integration period is 
called the pixel activation time. It depends on the incident beam 
intensity and the sensitivity of a photo diode. That is, the pixel 
activation time can be controlled by the beam intensity. On the 
other hand, the access time is limited by a search operation with 


address acquisition or a data readout operation. Therefore our 
principal aim is to achieve a short access time for high-speed 
position detection. 

Fig. 17 shows a cycle time of each pipelined stage at a 
400-MHz operation. The worst case of search signal propa- 
gation takes 90 ns. So the search path refresh and the search 
operations for the left and right edges each require 90 ns. The 
row-parallel address acquisition takes less than 200 ns in the 
worst case. The worst case of address acquisition occurs when 
all the detected pixels are placed on the same column because 
the load capacitance of a column address generator becomes 
largest and limits the injection speed of the bit-streamed column 
address signals. The total cycle time of search and address ac- 
quisition is 670 ns. The limiting factor of the access time is 
the digital readout stage from output buffers, which requires 
2737.5 ns. Therefore, the search and address acquisition can be 
repeated four times in the data readout period while maintaining 
the frame access rate. 

We have tested the maximum access rate of the designed 
sensor. The sensor allows user-specified pixel activation. The 
worst-case situation is set by an electrical pattern on the sensor 
plane. Fig. 18 shows measured waveforms of the worst-case 
frame access to an electrical test pattern at 432 MHz. Fig. 19 
shows a data readout circuit and the test equipment that was 
used for probing the output signals. Output buffers in each row 
are selected by SE L;,. The position results are read out by the 
dynamic readout circuits where are precharged by PRE, and re- 
ceived by sense amplifiers that are synchronized with SACK. 
The reference voltage V,.¢ is set to 300 mV below the supply 
voltage. The output signals are probed with parasitic capaci- 
tances of Cry and Cpz, which are 7 and 13 pF, respectively. 
All the activated pixels are set in the 374-th column as the worst- 
case situation. The expected results were successfully acquired 
up to 432-MHz operation. The image sensor attains a frame ac- 
cess rate of 394.5 kHz, which corresponds to 1052 range maps/s 
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Fig. 18. Measured waveforms of the worst-case frame access to electrical test 
pattern at 432 MHz. 


with 375 x 365 range data. The data rate is 144 Mbit/pin-s in 
the maximum frame access rate. 


B. Range Accuracy 


Fig. 20 shows the measured range accuracy at a target dis- 
tance of around 600 mm. The X axis represents target distance 
and the Y axis represents measured distance. Fig. 20(a) shows 
the measured results in the conventional single sampling mode. 
The maximum range error is 2.78 mm and the standard devia- 
tion of error is 1.02 mm. The conventional single sampling mode 
achieves 0.46% range accuracy with 0.5 sub-pixel resolution. 
The range error is typically dominated by the pixel quantiza- 
tion error of position detection on the focal plane. Therefore, the 
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Fig. 20. Measured range accuracy. (a) Single-sampling mode. 


(b) Multisampling mode. 


range error can be suppressed by the multisampling technique 
with four scales as shown in Fig. 20(b). The maximum range 
error is 1.10 mm and the standard deviation is 0.47 mm in the 
same situation. The multisampling mode attains 0.18% range 
accuracy, which corresponds to around 0.2 sub-pixel resolution. 
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Fig. 22. Measurement result of range finding. (a) Measured range data. 
(b) Target object. 


The range accuracy suffers from fluctuation of the threshold 
voltage of pixel activation. The peak-to-peak threshold fluctu- 
ation is about 150 mV including the reset voltage drop on the 
sensor, which is calculated by binary 2-D images that are mea- 
sured using various reset voltages. However, the intensity profile 
with four scales does not fatally suffer from the fluctuation be- 
cause the fluctuation has strong correlation with the location on 
the sensor and it is small enough to still allow the calculation of 
the center position in a local area. The timing of pixel activation 
is separated from the search and address acquisition operations 
as shown in Fig. 8. That is, the pixel activation is executed after 
the search path refresh and before the search signal propaga- 
tion. Therefore, the pixel activation is not affected by crosstalk 
caused by digital signaling on the focal plane. 


C. Example of Measured Range Image 


Fig. 21 shows a photograph of the present measurement setup. 
The baseline between a camera and a beam projector is set to 
180 mm. The target distance is 600 mm and the target scene 
is 90 x 90 mm?. A 300-mW laser beam is expanded by a rod 
lens as a sheet beam with 5 mm width. The beam wavelength 
is 635 nm. Fig. 22 shows an example of measured range im- 
ages. The measured 3-D data are plotted on three-dimensional 
coordinates as a wire-frame model (a) of a target object (b) in 
Fig. 22. In the present measurement setup, the limiting factor of 
the range finding is the pixel activation time. ‘So the system re- 
quires a higher sensitivity photo detector or a sharp and strong 
laser beam. Our future work is to get better performance of the 
designed image sensor by satisfying these system requirements. 
Table II summarizes the chip performances. 
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TABLE II 
CHIP PERFORMANCE 
Supply voltage L8V 
Max. clock freq. 432 MHz 
Frame access rate 394.5 kHz 


Data rate 

Range finding speed 
Sub-pixel resolution 
Range accuracy 


144 M bit/pin/sec 

1052 range maps/sec 

0.2 pixels (4 samplings) 
max. 1.10 mm @ 600 mm 
S.D. 0.47 mm @ 600 mm 


Power dissipation 1065 mW @ 432 MHz, 1.8 V 





VII. CONCLUSION 


We have presented a high-speed 3-D image sensor for a 1000 
range maps/s 3-D measurement system which has many poten- 
tial applications such as shape measurement of structural de- 
formation and destruction, quick inspection of industrial com- 
ponents, observation of high-speed moving objects, and fast 
visual feedback systems in robot vision. A row-parallel frame 
access architecture has been proposed for the high-speed range 
finding. The row-parallel search operations are executed by a 
chained search circuit embedded in a pixel on the focal plane. 
The bit-streamed column address flow enables row-parallel ad- 
dress acquisition with a compact circuit implementation. More- 
over a multisampling technique is available for range accuracy 
improvement. A 375 x 365 3-D range-finding image sensor has 
been designed and fabricated in a one-poly five-metal (1P5M) 
0.18-j1m standard CMOS process. It attains a high-speed frame 
access rate with multiple samplings. The maximum frame ac- 
cess rate is 394.5 kHz with four samplings, which has a poten- 
tial capability of 1052 range maps/s in the case of a sufficiently 
strong beam intensity. Then it provides 1.10 mm range accuracy 
at a target distance of 600 mm. It has been improved up to 0.2 
sub-pixel resolution by the multisampling technique. 
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A CMOS Smart Temperature Sensor With a 30 
O O 
Inaccuracy of +0.5 °C From —50°C to 120°C 
Michiel A. P. Pertijs, Student Member, IEEE, Andrea Niederkorn, Xu Ma, Bill McKillop, Member, IEEE, 
Anton Bakker, Senior Member, IEEE, and Johan H. Huijsing, Fellow, IEEE 
Abstract—A _ low-cost temperature sensor with on-chip 


sigma-delta ADC and digital bus interface was realized in a 
0.5 xm CMOS process. Substrate pnp transistors are used for 
temperature sensing and for generating the ADC’s reference 
voltage. To obtain a high initial accuracy in the readout circuitry, 
chopper amplifiers and dynamic element matching are used. High 
linearity is obtained by using second-order curvature correction. 
With these measures, the sensor’s temperature error is dominated 
by spread on the base-emitter voltage of the pnp transistors. This 
is trimmed after packaging by comparing the sensor’s output with 
the die temperature measured using an extra on-chip calibration 
transistor. Compared to traditional calibration techniques, this 
procedure is much faster and therefore reduces production costs. 
The sensor is accurate to within +0.5°C (3c) from —50°C to 
120°C. 


Index Terms—Calibration, curvature correction, dynamic offset 
cancellation, smart sensors, temperature sensors. 


I. INTRODUCTION 


NTEGRATED temperature sensors with an on-chip 
I analog-to-digital converter and bus interface find growing 
application in thermal management systems. These so-called 
“smart” temperature sensors are widely applied in PCs and 
laptops to monitor the temperature of the microprocessor, the 
case, and power-consuming peripheral ICs. This application 
requires low-cost temperature sensors with a desired inaccuracy 
below £1.0°C [1]. 

Previous smart temperature sensors were usually calibrated 
at one fixed temperature, at which their inaccuracy could be 
trimmed below +1.0°C at the cost of a time-consuming (and 
therefore expensive) calibration after packaging. Their inaccu- 
racy over the industrial temperature range is however larger than 
OPC (21-13): 

This paper describes in detail a smart temperature sensor 
which achieves an inaccuracy of +0.5°C (30) from —50°C 
to 120°C [9]. Costs are kept low by using a mature 0.5-j.m 
CMOS process and a fast calibration procedure. After pack- 
aging, the sensor is calibrated by measuring its die temperature 
using an extra on-chip calibration transistor. Thus, the required 
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calibration time is greatly reduced compared to a traditional 
calibration with an external reference thermometer. To obtain a 
high initial accuracy, dynamic offset cancellation and dynamic 
element matching are applied in the analog front-end. Good 
linearity over a wide temperature range is obtained by applying 
second-order curvature correction. 

This paper is organized as follows. Section II introduces the 
measurement principle, including the curvature correction tech- 
nique. In Section III, the analog front-end circuitry is discussed, 
which generates two temperature-dependent currents. These are 
input to a second-order sigma-delta ADC, which is described in 
Section IV. The calibration technique is detailed in Section V. 
The paper ends with experimental results in Section VI and 
conclusions. 


Il. MEASUREMENT PRINCIPLE 


To convert temperature to a digital value, both a well-de- 
fined temperature-dependent signal and a temperature-indepen- 
dent reference signal are required. Both can be derived from the 
base-emitter voltage of a bipolar transistor, in the form of the 
thermal voltage k7’/q and the silicon bandgap voltage [10]. In 
a CMOS process, substrate pnp transistors are mostly used for 
this purpose [11]. These are vertical bipolar transistors with a 
p diffusion as emitter, an n-well as base, and the p~ substrate 
as collector. 

Two voltages are of interest: the base-emitter voltage Vg r of 
a single transistor in its forward-active region, and the difference 
AVz er between the base-emitter voltages of two such transistors 
biased at different collector current densities. 


A. Temperature Dependence of Vag 


From the well-known exponential relation between the col- 
lector current Jc and the base-emitter voltage Vgp, the fol- 
lowing expression for Vg as a function of absolute temperature 
T can be derived [10]: 


Ee fs 
‘pE(T) =V,o\1-— —Vpz (I; 
Vpx(T) 90 ( =) + TV BEI ) 


kT if kT Ic(T) 
n—IlIn|— ]4 In - (1) 
q Ls, q Io (T,) 


where Vo is the extrapolated bandgap voltage at 0 K, 7 is a 
process-dependent constant, k is Boltzmann’s constant, ¢ is the 
electron charge, and J;. is an arbitrary reference temperature. 
As illustrated in Fig. 1(a), Vgz(T7) is an almost linear function 
of temperature, with a typical slope of —2 mV/K. The nonlin- 
earity, or curvature, is represented by the last two terms of (1). 
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(a) 








Fig. 1. (a) Temperature dependence of the base-emitter voltage Vz x. (b) Variation in Vz » due to process spread (curvature omitted for clarity). (c) Combination 
of Vee and AVgzp to yield the bandgap reference voltage Vr (curvature again omitted). 


It depends on the constant 7 and on the temperature dependence 
of the collector current. 

The slope of the base-emitter voltage depends on process pa- 
rameters and the absolute value of the collector current. Its ex- 
trapolated value at 0 K, however, is insensitive to process spread 
and current level, as illustrated in Fig. 1(b). Therefore, a calibra- 
tion at one temperature can be used to trim the slope of Vz xz to 
a desired value [12]. 

Vz pr is also sensitive to stress. Fortunately, substrate pnp tran- 
sistors are much less stress-sensitive than other bipolar transis- 
tors [13]. Packaging-induced shifts in Vg will be corrected by 
calibrating the sensor after packaging, as will be discussed in 
Section V. 


B. Temperature Dependence of AVBr 


The difference AVgp between the base-emitter voltages of 
a transistor operated at two collectors Ic; and J@2 can be ex- 


pressed as [10] 
Kae Io: 
In (2) eel 
q Ie. 


Provided the collector-current ratio is constant, AVgp is pro- 
portional to absolute temperature (PTAT), as shown in Fig. 1(c). 

In contrast with Vg, AVpp is independent of process pa- 
rameters and the absolute value of the collector currents.! More- 
over, it is insensitive to stress [15]. Its temperature coefficient 
is, however, typically an order of magnitude smaller than that of 
Vp (depending on the collector current ratio). 





AVen(T) = Van2(T) — Vari (TL) = 


C. Combining Vprz and AVgE 


In a bandgap voltage reference, an amplified version of 
AVper is added to Vgpr to yield a temperature-independent 
reference voltage Ver, as illustrated in Fig. l(c). In our 
temperature sensor, this addition is implemented in the current 
domain at the input of the sigma-delta modulator (Fig. 2). 
Depending on the bitstream output bs of the modulator, either 
a current AVpz/R, is integrated (when bs = 0) or a current 
—Vpetrim/R2 (when bs = 1), where Vgririm is a trimmed 
base-emitter voltage. The negative feedback in the modulator 


'Often a multiplicative factor n is included in the equation for AVepe to 
model the influence of the reverse Early effect and other nonidealities [14]. If 
Vern and AVgp are generated using transistors biased at approximately the 
same current density, an equal multiplicative factor will appear in Vex. Ina 
smart temperature sensor, these factors cancel, and will therefore not be consid- 
ered further. 


VBEtrim 
i 








by AVpe 
Ry 


Fig. 2. Simplified circuit diagram of the sigma-delta modulator. 


will ensure that the average current flowing into the integrator 
is zero. This implies 











VB Etrim of (1 af 1) VBE 
Ro es ee Py 
Los) aAVer ie aAVeRE (3) 
VBBtrim + AAVBE VREF ; 


where ju is the average value of the bitstream (1.e., the fraction 
of 1’s), and a = R2/R,. The denominator of (3) is essentially 
a bandgap reference voltage, while the numerator is PTAT. The 
average ju will therefore also be PTAT, so that the bitstream can 
be used, with appropriate scaling in the digital decimation filter, 
to produce a digital representation of the chip’s temperature in 
degrees Celsius. 

With the configuration of Fig. 2, only about 30% of the dy- 
namic range of the sigma-delta modulator is used, since 1 = 0 
corresponds to —273°C and 4. = 1 corresponds to approxi- 
mately 325°C, while the temperature range of interest is from 
—50°C to 125°C. Other combinations of Vp ririm and AVgr 
can be used to utilize more of the dynamic range [16], but these 
require copying or scaling of the currents, thus introducing more 
sources of errors. Since a second-order sigma-delta modulator 
is used, which can easily provide sufficient resolution, a more 
efficient use of the dynamic range is not needed. In fact, for the 
single-loop modulator used (Section IV), the quantization noise 
strongly increases for j1 close to 0 or 1. With the configuration 
of Fig. 2, these regions are conveniently avoided. 


D. Curvature Correction 


The curvature of Vgpz will also be present in the reference 
voltage Veer, which, in turn, results in a nonlinearity in ju(7’). 
The curvature is modeled by the last two terms in (1). For a value 
of 7 = 4.4 for our process and a PTAT collector current (as used 
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Fig. 4. Simplified circuit diagram of the AV 2-dependent current source. 


in our design), the corresponding nonlinearity amounts to 2 °C 
over the temperature range of —50°C to 125°C. 

Fortunately, the second-order component of the curvature can 
easily be eliminated by giving Vezr a small positive tempera- 
ture coefficient [4], [17], i.c., by making a in (3) slightly larger 
than in a bandgap reference. With an appropriate value for « (22 
in our case), such a temperature-dependent Vrgr gives rise to 
a second-order nonlinearity in (7) which exactly cancels the 
second-order nonlinearity originating from Vg ez. What remains 
is a third-order nonlinearity of about 0.3 °C over the tempera- 
ture range. 


E. Block Diagram 


The input currents for the sigma-delta modulator of Fig. 2 are 
generated by a AVgxz/R, current source and a Vegtrim/Re 
current source, as shown in the block diagram of Fig. 3. A deci- 
mation filter converts the bitstream output of the modulator to a 
digital representation of the temperature, also taking care of the 
scaling required to convert the average value yu of the bitstream 
to °C. The result is communicated to the outside world using an 
I°C bus interface. Also on the chip are the calibration transistor, 
a PROM to hold the setting of the trimming of Vg, a biasing 
circuit and an oscillator. 


Ill. TEMPERATURE-DEPENDENT CURRENT SOURCES 
A. AVpgr-Dependent Current Source 


A simplified circuit diagram of the AVgz/R, current source 
is shown in Fig. 4 [16]. Two substrate pnp transistors Q; and Q»2 
are biased at a 3:1 current ratio. The bias currents are generated 
in a separate circuit (not shown). The resulting difference in 
base-emitter voltage AV gz has a sensitivity of 100 wV/°C. By 
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Fig. 5. Principle of a nested-chopper amplifier. 


means of the feedback loop, AVzz is generated across a resistor 
R, in series with the base of Qo, resulting in the desired output 
current. To avoid that the output current is affected by the base 
current of Qo, a resistor 2, /3 is added in series with the base 
of Q,. As the base current of Q is three times as large as that 
of Qo, the base currents result in an equal voltage drop across 
both resistors, which is a small common-mode change that does 
not affect the output current. 

The inaccuracy of the circuit of Fig. 4 is mainly determined 
by the offset V,, of the opamp, which directly adds to AVg ez. 
To result in a negligible temperature error (0.1 °C), this offset 
has to be smaller than 10 wV. Since typical offsets of CMOS 
opamps are in the millivolt range, offset cancellation is re- 
quired. Mismatch in the current sources or the pnp transistors 
also leads to temperature errors. For these errors to be negli- 
gible, the matching has to be better than 0.035%, which requires 
dynamic element matching. 

The offset of the opamp can be reduced using the chopping 
technique. In a regular chopper amplifier, a pair of chopper 
switches is added around the amplifier whose offset V,, needs 
to be cancelled (Fig. 5) [16]. The chopper at the input modulates 
the input signal to the frequency of control signal ¢7, which 
lies above the offset and 1/f corner frequency of the amplifier. 
The chopper at the output demodulates the amplified input 
signal, and simultaneously modulates the amplified offset and 
1/f noise to the frequency of #7, where they can be filtered 
out by a low-pass filter (LPF). 

Due to charge injection and clock feedthrough, a regular 
chopper amplifier has a typical residual offset of a few tens 
of microvolts. To reduce the offset below 10 wV, an extra 
outer pair of chopper switches is added. This is controlled by 
a low-frequency control signal ¢;. This pair modulates the 
regular chopper amplifier’s residual offset to the frequency of 
oz, where it can also be removed by the LPF. The residual 
offset of the resulting nested-chopper amplifier is determined 
by clock feedthrough and charge injection in the low-frequency 
chopper switches, and is therefore much smaller than that of the 
regular chopper amplifier. Residual offsets as low as 100 nV 
have been reported [18]. 

Fig. 6 shows how the nested-chopper amplifier is embedded 
in the AVgz/R, current source. The opamp is split up into three 
stages, with chopper switches between them. The first stage is a 
folded-cascode amplifier, the second stage is a differential pair, 
and the third stage is its current mirror load. Miller compensa- 
tion (not shown) is used to stabilize the opamp. 
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Fig. 6. Detailed circuit diagram of the AV, »-dependent current source. 


The input chopper driven by #y is implemented in the cur- 
rent domain, by switching between a 3:1 and 1:3 current ratio. 
Thus, offset resulting from mismatch between the pnp transis- 
tors is also chopped. To maintain the correct feedback polarity, 
the connection to the output transistor is switched back and forth 
between:the bases of Q, and Qo. As in Fig. 4, compensation for 
the base currents is realized by making sure that a resistor 2, /3 
is in series with the base of the transistor that carries the larger 
bias current. 

The bias currents are generated by four current sources of 
0.5 A each, which are dynamically matched using the control 
signals #y and ¢;. Alternately, one of the current sources biases 
one transistor, while the remaining three bias the other. 

The control signal 47 switches at 16 kHz, while #7 switches 
at 80 Hz. The modulated offset and 1/f noise components are 
filtered out by the sigma-delta modulator and the decimation 
filter, as will be discussed in Section IV. 


B. Vpp-Dependent Current Source 


The trimmed base-emitter voltage Ve rtrim is generated by 
adjusting the base-emitter voltage Vg of a substrate pnp tran- 
sistor with a small programmable PTAT voltage. Fig. 7 shows 
how this is implemented: a PTAT current is passed through a 
digitally programmable resistor in series with a diode-connected 
substrate pnp. The PTAT voltage across this resistor compen- 
sates for the PTAT-type spread on Vgz [Fig. 1(b)]. The PTAT 
current in Fig. 7 is generated in a separate bias circuit (not 
shown). 

The current Vz ptrim/ R2 is generated using a voltage-to-cur- 
rent converter around a regular chopper amplifier controlled by 
é,. Because of the higher sensitivity of Vgg(—2 mV/°C), a 
nested-chopper amplifier was not needed here. The amplifier has 
a folded-cascode topology. To accurately define the ratio a in 
(3), the resistors R; and Ry are made of identical unit resistors. 

To save power, the nominal output current is kept relatively 
small (0.5 1A ). Therefore, a large resistance (more than | M{) 
is required. In order to reduce the size of the resistor, a cur- 
rent mirror with a dynamically matched 3:1 ratio is used. The 
dynamic element matching is again controlled by #; and $y. 
Thus, the chip area required for the resistor is reduced by a factor 
3 without using special high-resistivity resistors (which would 
require extra processing steps). 
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Fig. 7. Circuit diagram of the Vz 2-dependent current source. 
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Fig. 8. Circuit diagram of the sigma-delta modulator; initialization circuits are 


omitted for clarity; unused currents are switched to V;. 


IV. SIGMA-DELTA ADC 


A sigma-delta ADC is used to convert the temperature-de- 
pendent currents into a digital temperature reading. A quanti- 
zation noise below 0.05 °C in a conversion time of 30 ms was 
desired. With a first-order sigma-delta modulator, as was used 
in previous work [4], [16], this would require a clock frequency 
of about 500 kHz. As this would lead to an undesirably high 
power consumption, a second-order modulator was used, which 
requires a clock frequency of only 16 kHz. 

As in an incremental ADC [19], the integrators of the modu- 
lator are reset at the beginning of the conversion, and a second- 
order decimation filter is used rather than the usual third-order 
filter. In contrast with an incremental ADC, however, the input 
signal is not sampled and held during the conversion, but it is 
integrated continuously so as to filter out the modulated offset 
and 1/f noise. 


A. Sigma-Delta Modulator 


A simplified circuit diagram of the sigma-delta modulator 
is shown in Fig. 8. It is clocked using a nonoverlapping clock 
which runs at the same frequency as the control signal #7 in the 
current sources. This ensures that modulated offset at harmonics 
of @y is averaged out within a clock cycle of the modulator. As 
discussed in Section II-C, the bitstream determines which of the 
two input currents is integrated on the first integrator. Unused 
currents are dumped into a reference node at V; (not shown). 

During clock phase ¢1, the output of the first integrator is 
sampled on capacitor C2. During phase ¢2, the charge is trans- 
ferred to the second integrator, the output of which is fed into 
a clocked comparator that produces the bitstream bs. To mini- 
mize charge injection onto C2, clock signals 4g and ¢2q have 
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Fig. 9. (a) Initialization circuit for the second integrator. (b) Waveforms during the initialization sequence. 


delayed downgoing edges with respect to ; and 2. Scaled 
copies of the input currents are integrated on the second inte- 
grator during phase ¢, to ensure stability of the modulator and 
to minimize the swing at the output of the first integrator. The 
scaled copies are not critical for the dc accuracy of the modu- 
lator; mismatches up to several percent can be tolerated. 

The modulator is implemented using MOS capacitors, to 
avoid the extra processing steps required for linear capacitors. 
C; and C are made from identical unit capacitors to ensure 
linear charge transfer in spite of the nonlinearity of these 
capacitors. The nonlinearity of C3 is not relevant, since only 
the sign of the output of the second integrator is detected by the 
comparator. 

To maximize the capacitance per area of the MOS capaci- 
tors, and to avoid operating them in their most nonlinear region 
(around 0 V), they are biased in accumulation. The gates of the 
capacitors are at V;, while the feedback ensures that the average 
voltage on their wells is V2. Therefore, they can be biased in ac- 
cumulation by choosing Vj sufficiently higher than V2 (1.2 V in 
this case). 


B. Initialization Sequence 


In contrast with the usual continuous operation of sigma-delta 
ADCs, the temperature sensor requires a “one-shot” type of op- 
eration, i.e., the converter is powered up, produces a single con- 
version result, and powers down again to save power. This has 
implications for both the initialization of the modulator, and the 
design of the decimation filter. 

After power-up, the modulator is brought into a well-defined 
state by resetting the integration capacitors. After the reset, 
the integration capacitors could be driven into accumulation 
by the feedback loop, but this may take many clock cycles 
(depending on the input signal). To expedite this, the capacitors 
are precharged using an initialization current J;,,;:, as shown in 
Fig. 9(a) for the second integrator. The initialization current is 
switched to the input of the integrator until its output reaches 
the voltage V2, which is detected by the comparator. 

To allow for similar initialization of the first integrator, its 
output can be connected to the input of the comparator using 
a set of switches (not shown). The total initialization sequence 


consists of resetting both integration capacitors, precharging the 
capacitor of first integrator, and then precharging that of the 
second integrator. The corresponding waveforms are shown in 
Fig. 9(b). 


C. Decimation Filter 


Once the modulator has reached its steady state, the bitstream 
is fed into a decimation filter, which produces a single con- 
version result. Usually, the order of a sinc decimation filter is 
chosen one higher than that of the loop filter [20], which im- 
plies a third-order filter for our second-order modulator. How- 
ever, for a given conversion time, and thus a given impulse re- 
sponse length of the filter, the corner frequency of a third-order 
filter is higher than that of a second-order filter. Due to this 
higher corner frequency, the use of a third-order filter will result 
in more quantization noise, in spite of its faster roll-off. There- 
fore, a less complex sinc? filter is used rather than a sinc’ filter. 

For the chopping and dynamic element matching in the cur- 
rent sources to be effective, the decimation filter has to filter out 
the residuals modulated by the low-frequency control signal @;. 
Therefore, /z is clocked at a frequency that coincides with the 
first zero in the frequency response of the sinc? filter, which is 
at approximately 80 Hz. 

The decimation filter is implemented by an up/down counter 
and an accumulator. The counter counts up during the first half 
of the decimation period and down during the second half, thus 
realizing the triangular impulse response of a sinc? filter. The 
accumulator adds the counter value if the bitstream is “1”. The 
initial value of the accumulator and the exact length of the deci- 
mation period (and thereby the gain of the filter) are chosen such 
that the accumulated value at the end of the conversion can be 
directly interpreted as a temperature in degrees Celsius. 


V. CALIBRATION TECHNIQUE 


To calibrate any integrated temperature sensor, its temper- 
ature reading has to be compared to that of a reference ther- 
mometer at the same temperature as the sensor chip. The differ- 
ence between the readings may then be used to trim the sensor. 
This calibration is often done at wafer-level, which has the ad- 
vantage that the temperature of the whole wafer can be stabilized 
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Fig. 10. Connection of the calibration transistor by reusing digital input pins. 


and measured, after which the individual sensors can be cali- 
brated and trimmed using a wafer prober. An important disad- 
vantage of this approach, however, is that additional errors intro- 
duced by packaging stress are not taken into account. Even when 
the sensor design is based on relatively stress-insensitive sub- 
strate pnp transistors, a significant error will result if a low-cost 
plastic package is used. Experiments on a bandgap reference 
based on such pnps have shown shifts up to 2 mV in Vgz [13]. 
As can be derived from (3), this translates to a temperature error 
of about 0.5 °C. Therefore, it is desired to do the calibration after 
packaging. 

If the temperature of each individual packaged sensor has to 
be stabilized and measured with an inaccuracy below +0.5 °C 
using a reference thermometer, this becomes the dominant con- 
tributor to the test time of the sensor. A faster and therefore 
cheaper alternative is to make use of the process- and stress-in- 
sensitivity of AVg xz (discussed in Section II-B): an extra sub- 
strate pnp transistor, the calibration transistor, has been inte- 
grated on the sensor chip, and is used as a reference thermometer 
inside the package [21]. From its AVgz, measured using ex- 
ternal electronics, the die temperature can be determined within 
+0.1°C [21]. As it is integrated on the same thermally con- 
ducting silicon as the sensor circuit, very little thermal settling 
time is required. Moreover, the requirements on the thermal sta- 
bility of the production setup are relaxed. 

Fig. 10 shows how the calibration transistor (Qc.4z) is con- 
nected without reserving extra pins for it. Two existing address 
pins of the I?C bus interface are reused during calibration to 
connect to the base and emitter of Qc 4,. During normal oper- 
ation, Qc4z is isolated from these pins using MOS switches. 
These switches are controlled via the bus interface. 

The temperature of the on-chip calibration transistor is de- 
termined by applying a number of bias currents to it, and mea- 
suring its base-emitter voltage and base current using external 
electronics. Thus, AVgz can be measured while compensating 
for series resistances [22] and current-gain variations. From this, 
the chip temperature can be calculated with an absolute accuracy 
of +0.1 °C [21]. The difference between this temperature and a 
reading of the sensor is then used to determine the appropriate 
setting for the programmable resistor R},;;, in Fig. 7. This set- 
ting is then programmed in PROM via the I?C bus interface. 


VI. EXPERIMENTAL RESULTS 


The temperature sensor was fabricated in a standard 0.5-ym 
CMOS process. A chip micrograph is shown in Fig. 11. The chip 


| 


Fig. 11. 
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TABLE I 


PERFORMANCE SUMMARY 











Technology 0.54m CMOS 
Chip size 2.5mm? 
Supply voltage 2.7V — 5.5V 





Temperature range 


—50°C = 125°C 





Conversion rate 


0.125 — 30 conversions/s 





Noise level 


0.03°Crms 





Supply current 


130A at 10 conversions/s 








Power supply rejection 


0.3°C/V from 3.0V to 3.6V 








Inaccuracy (30) 


+0.3°C at 25°C 
+0.5°C from —50°C to 120°C 











area is 2.5 mm?, of which about half is used for the digital bus 


interface and control. 


Fig. 12 shows the measured temperature error of 32 samples 
from one processing batch, operated at a supply voltage of 3.3 V. 
These samples were packaged in 8-pin ceramic packages. They 
were calibrated and trimmed at room temperature using the de- 
scribed procedure, after which they were placed in an oven along 
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TABLE II 
COMPARISON OF INACCURACY WITH PREVIOUS WORK 









































Reference Inaccuracy | Range Conditions Calibration 

Bakker, 1996 [2] EL ORG —40°C to 120°C | min/max of 3 samples | after packaging, 2 points 

Tuthill, 1998 [3] lee eake, —50°C to 125°C | min/max of 6 samples | wafer-level, 1 point 

Pertijs, 2001 [4] eeduaun@ —50°C to 125°C | +30 of 32 samples batch-calibration _ 

LM92 [5] +0.33°C 30°C | min/max unknown ; 
aalepe —25°C to 150°C | min/max unknown 

DS1626 [6], ADT7301 [7] | +0.5°C | 0°C to 70°C min/max unknown is 
£2.0°G —55°C to 125°C | min/max unknown 

SMT160-30 [8] EDt at Gs —30°C to 100°C | min/max wafer-level, 1 point 2 
eel —45°C to 130°C. | min/max wafer-level, 1 point 

This work +0.3°C | 25°C +30 of 32 samples after packaging, i. point | 
reOLbeG. —50°C to 125°C | +3o of 32 samples after packaging, 1 point 

















with a platinum resistor calibrated to 20 mK. Their 30 inaccu- 
racy in the temperature range of —50°C to 120°C is +0.5 °C. 
The performance of the chips is summarized in Table I. 

Table II compares the inaccuracy with that of previous work. 
Though many smart temperature sensors have been published, 
only a few publications provide sufficient measurement results 
for a proper comparison [2]—[4]. Since most work in this field 
is done in industry, the inaccuracy specifications of four leading 
commercial temperature sensors have also been included in the 
table [5]-[8]. At room temperature, the presented sensor per- 
forms as well as the best-performing previous work, while over 
a wide temperature range it performs significantly better. 


VII. CONCLUSION 


A CMOS temperature sensor with integrated second-order 
sigma-delta ADC and bus interface has been presented. A high 
initial accuracy is achieved by applying dynamic offset cancel- 
lation and dynamic element matching in the front-end circuitry, 
and by applying a linearization technique that eliminates the 
second-order curvature. With these measures, the spread on the 
base-emitter voltage is the dominant source of errors. This is 
trimmed based on the results of a single-point calibration, which 
takes place after packaging. The chip temperature is determined 
from the electrical characteristics of an additional on-chip tran- 
sistor, which are measured using external electronics. Thus a 
fast and accurate calibration can be performed. After calibration 
at room temperature and trimming, the sensor has a 3o inaccu- 
racy of +0.5 °C in the temperature range of —50°C to 120°C, 
which is, to date, the highest reported accuracy for this type of 
sensors. 
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A Four-Channel 3.125-Gb/s/ch CMOS Serial-Link 
Transceiver With a Mixed-Mode Adaptive Equalizer 


Jinwook Kim, Jeongsik Yang, Sangjin Byun, Hyunduk Jun, Jeongkyu Park, Cormac S. G. Conroy, Member, IEEE, 
and Beomsup Kim, Fellow, IEEE 


Abstract—This paper presents a quad-channel serial-link 
transceiver providing a maximum full duplex raw data rate of 
12.5 Gb/s for a single 10-Gbit eXtended Attachment Unit Interface 
(XAUD in a standard 0.18-44m CMOS technology. To achieve low 
bit-error rate (BER) and high-speed operation, a mixed-mode 
least-mean-square (LMS) adaptive equalizer and a low-jitter 
delay-immune clock data recovery (CDR) circuit are used. The 
transceiver achieves BER lower than < 4.5 x 1071° while its 
transmitted data and recovered clock have a low jitter of 46 and 
64 ps in peak-to-peak, respectively. The chip consumes 178 mW 
per each channel at 3.125-Gb/s/ch full duplex (TX/RX simulta- 
neous) data rate from 1.8-V power supply. 


Index Terms—Adaptive equalizer, clock data recovery (CDR), 
serial-link transceiver. 


I. INTRODUCTION 


N MODERN electrical interconnect systems, high-speed se- 
rial links have replaced parallel data buses, and serial link 
speed is rapidly increasing due to the evolution of CMOS tech- 
nology. For example, high-end routers and backbone switches 
have wide parallel buses to communicate to network terminals 
such as network processors. High pin counts result in high-cost 
processors and switches, and makes system engineering and 
board design difficult because of coupling and skew between 
bus lines. High-speed serial links eliminate these problems. 
Serial link performance is limited by: 1) noise, which intro- 
duces timing and amplitude errors, and 2) the bandwidth ‘lim- 
itations of the electronic components. In order to resolve the 
inter-symbol interference (ISI) problems caused by bandwidth 
limitations, pre-emphasis techniques are used on the transmitter 
side [2], and adaptive equalization is used on receiver side [3]. 
Pre-emphasis is good for well-known channel characteristics 
but it cannot adapt to channel variations. Moreover, the larger 
voltage swing caused by pre-emphasis generates more ringing. 
An adaptive equalizer compensates channel distortion caused 
by limited bandwidth. Analog implementations have the advan- 
tage of filtering speed over digital implementations. Further- 
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more, even though analog approaches suffer from the nonideal- 
ities of analog components and noise, analog approaches have 
the advantage that as the filtering occurs before sampling, they 
avoid the signal processing delays—i.e., latency—due to dig- 
ital filtering, which affect the performance and stability of the 
phase-locked loop (PLL) that provides the sampling clock [8]. 
Two kinds of analog implementation have been used. One is a 
sampling-type equalizer [5] and the other is a continuous-time 
equalizer [7]. With the sampling-type equalizer, the sample-and- 
hold circuits become unstable as the data-rate increases. Re- 
cent research has introduced a post-equalizer [3] at several Gb/s 
rates, but without any adaptation algorithm. This paper proposes 
a mixed-mode adaptive equalizer that takes advantages of both 
high-speed analog continuous-time filtering and the stability of 
digital tap adaptation. 

One key building block of an analog continuous-time 
transversal equalizer is an analog delay line. In order to meet 
the required one bit clock period delay, programmability and 
tuning circuits are normally necessary. This paper introduces 
an analog delay line that generates exact 1-bit delay without 
any tuning circuits. 

The clock data recovery (CDR) circuit plays a critical role 
in the receiver. It extracts the clock and regenerates data from 
the input data stream and reduces the timing error, one of the 
critical system performance limiting factors. In low-frequency 
applications a digital PLL can be used for good jitter suppres- 
sion or jitter tolerance [9]. Phase-tracking CDRs have been used 
for several Gb/s rates [14], [15] because they do not suffer from 
phase quantization errors. Comparing the two kinds of phase de- 
tection methods, the binary CDR is more suitable for high-speed 
operation than the linear CDR because it does not suffer from 
the timing offset caused by setup/hold-timing uncertainty of the 
sampler [16]. 

The jitter of a binary CDR circuit is set by the minimum res- 
olution of the phase interpolator because of its bang-bang op- 
eration [6]. In the case of an ideal CDR circuit with no delay, 
which immediately updates the timing, the recovered clock jitter 
is limited by the minimum resolution of the phase interpolator. 
If there are some delays in the recovery loop, the jitter is more 
than the minimum resolution because the delays in the recovery 
loop prevent immediate timing update. In this paper, we present 
a new delay-immune CDR circuit. By ignoring the successive 
Up/Dn value of the delay amount in the recovery loop, it can 
implement an ideal bang-bang operation and reduce the jitter of 
the recovered clock. 

This paper is organized as follows. The structure of the 
proposed transceiver architecture is presented in Section II. 


0018-9200/$20.00 © 2005 IEEE 





KIM et al.: A FOUR-CHANNEL 3.125-Gb/s/ch CMOS SERIAL-LINK TRANSCEIVER WITH A MIXED-MODE ADAPTIVE EQUALIZER 463 


9b Input 9b 
Ea 
FIFO 9b 













Control 






Test 
Pattern 
Generator 





BER Tester 


8b/10b 
Decoder 


9b 
— Output 
FIFO 9b 


Fig. 1. Block diagram of the transceiver. 


Section III explains circuit implementation of each sub-block. 
Finally, experimental results are given in Section IV, and con- 
clusions presented in Section V. 


II. CHIP ARCHITECTURE 


The 10-Gb XAUI specification, from the 10-G Ethernet stan- 
dard 802.3ae, defines the chip-to-chip interconnect protocol as a 
12.5-Gb/s full duplex raw data rate with 3.125 Gb/s per channel 
on four channels [1]. The implementation described in this paper 
targets the XAUI specification. 

The transceiver uses a four-phase clock with half-rate 
frequency. This clocking scheme enables two-level of input 
muxing on the transmit (TX) side, and 2X oversampling on the 
receive (RX) side. The binary CDR uses the 2X oversampled 
data to recover the clock using a phase-tracking method. A 
mixed-mode adaptive equalizer is used to reduce ISI. Fig. 1 
shows a simplified block diagram of the transceiver. 

In the transmit path, input FIFO performs rate matching be- 
tween 10-Gb Media Independent Interface (XGMII) and XAUI. 
An 8 b/10 b encoder converts the input octet data with a control 
bit to a 10-bit coded word. This encoder limits the maximum 
run length less to than 5 and as a result, every symbol has timing 
information. Furthermore, it guarantees dc balance because the 
coded word has balanced 1s and Os. 

The serial TX driver then serializes these coded words, or a 
test pattern generated from a test pattern generator. It uses input 
multiplexing (muxing) rather than conventional output muxing 
because an input-multiplexed transmitter has the advantages of 
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small chip area, low power, and low jitter [18]. Two kinds of 
test patterns are used. One is bit pattern that includes high-fre- 
quency, low-frequency, and mixed-frequency pattern. The other 
is packet pattern specified in 802.3ae that consists of continuous 
jitter and continuous random jitter. 


In the receive path, a mixed-mode adaptive equalizer reduces 
the ISI induced from the channel to slim the pulses and make the 
“eye” open. The two-tap adaptive equalizer consists of a 1-bit 
delay cell, preamp, TX modeler and tap adaptation circuitry. The 
1-bit delay cell delays the analog input by one unit interval (UI) 
using a delay cell. The delay amount is controlled by a PLL 
locked to an external reference clock. Tap adaptation uses the 
sign-sign least-mean-square (LMS) algorithm due to its sim- 
plicity of implementation, and it is implemented in the digital 
domain. 


The sampler sequentially latches the output of an adaptive 
equalizer using the four-phase PLL clocks and generates 2X 
oversampled data. The CDR circuit extracts the timing informa- 
tion from the 2X oversampled data and feeds the correct sam- 
pling timing to the RX sampler using the phase interpolator. The 
phase interpolator mixes two clock signals selected by the CDR 
circuit and generates an interpolated clock. 

Finally, the synchronizer finds the word boundary from the bit 
stream using a comma detector and an 8 b/10 b decoder recovers 
the transmitted octet from the coded words. Output FIFO offers 
the capabilities of rate matching and channel alignment among 
four channels using ordered sets for channel alignment ||A|| as 
specified in 802.3ae. 
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Fig. 2. Block diagram of PLL and 1-bit analog delay cell. 


II. CiRCUIT IMPLEMENTATION 


A. Clocking and Signaling 


This transceiver contains an on-chip clock generation PLL 
to provide global four-phase half-rate clocks. Since the jitter 
performance of the PLL ultimately determines the transceiver 
performance, the clock generation PLL is one of the most 
important parts of transceiver. In order to achieve low-jitter 
operation, a PLL design requires buffer stage designs with 
low supply and substrate noise sensitivity. For robustness, this 
transceiver employs a self-biased PLL that provides a very 
broad frequency range, minimized supply and substrate noise 
induced jitter, and a high input tracking bandwidth [12]. The 
intrinsic immunity to process technology and environmental 
variability of self-biasing also gives more stability to the PLL. 
Fig. 2 depicts the block diagram of this PLL. 

Deterministic jitter usually comes from the phase mismatch 
of the PLL. To meet the jitter requirements at the near end, 
phase mismatch should be less than 15.3° (0.85 UI). Careful 
layout was used to avoid mismatches among delay cells and 
clock signal paths. In order to reduce the noise coupling from 
the substrate, fully differential design and decoupling capaci- 
tors were used. To isolate the PLL from the noisy transmitter 
and digital circuitry, guard rings and separated power pins were 
used also. 

The differential buffer delay stage used in the PLL requires 
an inverter chain to supply clocks at rail-to-rail level. This high- 
frequency level shifter has a bandpass type transfer function and 
reduces low-frequency noise caused by the source follower and 
other circuits. Since the inverter with input and output shorted 
has geometry scaled proportional to the inverters in the inverter 
chain, it gives an optimal input dc bias level. 

A 1-bit delay cell shown in Fig. 2 gets control voltage from 
the PLL and yields exact 1-bit time 7’, whenever the PLL is 
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Fig. 3. Block diagram of input-multiplexed transmitter using shifters. 

in locked state. These cascaded delay stages can be used as an 
analog delay cell which is one of the key components of a con- 
tinuous time analog equalizer, and will be described later in the 
adaptive equalizer section. 

In general, since input-multiplexed transmitters require 
smaller layout area and have smaller parasitic components 
at the output node, they achieve better performance than 
output-multiplexed transmitters [17]. This transceiver also 
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Fig. 5. Block diagram of adaptive equalizer. 


adopts an input-multiplexed transmitter using shifters. Fig. 3 
shows the transmitter that comprises two 5-bit shifters, a 
multiplexer and an output driver. The shifters load 10-bit data 
at every fifth rising edge of the 1.56-GHz clock. One shifter 
transfers data to the multiplexer at every rising edge and the 
other at every falling edge of the 1.56-GHz clock. The mul- 
tiplexer serializes two outputs of the shifters and the output 
driver transmits 3.125-Gb/s data through the channel. 


B. Adaptive Equalizer 


From a time-domain viewpoint, channel attenuation forces 
transferred symbols to spread in time and to interfere each other 
(ISI). The equalizer in the receiver side sharpens the transition 
edges of the signal. Sharper transition edges result in wide data 
eye openings and larger timing margin for signal detection. This 
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(a) A simplified block diagram of the continuous-time forward equalizer and (b) operation of the equalizer in time domain. 


; Digital Tap 
Adaptation 


effect mainly appears at the most high-frequency bit sequence, 
repeated “O01” as shown Fig. 4(b). The figure illustrates the op- 
eration of analog filtering in the time domain. 

Fig. 5 illustrates a mixed-mode adaptive equalizer with a 
two-tap LMS adaptation loop. The analog filtering part realizes 
an analog transversal equalizer (ATE) and performs high-speed 
filtering, while the digital tap adaptation part updates the 
coefficients based on the decision result. The analog filtering 
equation is 


y(t) = eo(t) - a(t) + e1(t) a(t — T) (1) 
where 7’ is the symbol period. 


The analog circuit comprises a variable gain amplifier, an 
analog delay line, a transmitter modeler and an error com- 
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parator. The variable gain amplifier performs the main filtering 
functions. It uses two differential pairs with connected output to 
the same PMOS loads. Tail current sources act as gain modifiers 
for each tap and their control voltages are the tap coefficients. 
Each differential pair multiplies the tap coefficients by its input 
and the resulting currents are summed at the PMOS load to 
yield the estimated signal y(t). 

The transmitter modeler generates the reference signal d(t) 
according to the digital value extracted from the estimated 
signal y(t). The error comparator compares the generated 
ideal transmit waveform of the transmitter modeler with the 
estimated signal. The compared result is sampled and fed to the 
digital tap adaptation circuit to update the tap coefficients. 

The coefficients c,(t) in the equalizer can be adapted using 
the sign-sign LMS algorithm [7]. Charge pumps are used to up- 
date the analog coefficient from the output of digital tap adapta- 
tion circuitry. The update equations for the equalizer coefficients 
are 


Cy (n +1) = O(n) + w- sign [e(n)] - sign [z(n —kT)] (2) 
where 
sign [e(k)] = sign [(y(t) — d(t)) leaner] (3) 


and 1 is scaling factor. 

An important advantage of using the sign-sign LMS algo- 
rithm is the simplicity of implementation for the multiplica- 
tion operation in (2). Some potential problems with analog fil- 
ters such as offset and gain errors are mitigated by the LMS 
algorithm [18]. 

This adaptive equalizer employs cascaded differential buffer 
delay stages to realize a 1-bit delay 7’. If the PLL locks to an ex- 
ternal clock reference, the resulting VCO control voltage makes 
delay of four-delay cell 180° phase shift as shown in Fig. 2, be- 
cause an V-stage oscillator generates one cycle of oscillation 
after propagating through each stage two times. The VCO con- 
trol voltage feeds the analog 1-bit delay cell also and the cas- 
caded delay stages then yield a delay of half an oscillation cycle, 
that is, a precise 1-bit delay time because half-rate clocking has 
been used. Therefore the analog delay line always generates a 
1-bit delay time automatically whenever the PLL is in locking 
state. 

A generated 21° — 1 pseudo-random bit sequence (PRBS) at 
3.125-Gb/s signal at the end of a 50-cm PCB trace was sup- 
plied to the equalizer with a proper setting. Fig. 6 illustrates the 
result of the HSPICE simulation. As may be seen, the eye is 
completely open with sufficient margin for the demultiplexing 
sampler. 


C. Delay-Immune Clock Data Recovery (CDR) 


In addition to adaptive equalization, the CDR circuit, which 
retrieves the clock from the nonreturn-to-zero (NRZ) data, is 
one of the key components of the receiver. It extracts the clock 
information from the data transitions and adjusts the phase of 
the sampling clocks. In the tracking phase detection technique, 
traditional proportional tracking data PLLs offer good loop sta- 
bility and bandwidth, but generally suffer from a systematic 
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Fig. 6. Simulated eye diagram of adaptive equalizer (a) input and (b) output. 


phase offset and long lock time. The binary CDR technique po- 
tentially provides a higher tracking bandwidth and greater ro- 
bustness to phase noise than the PLL based algorithm, but the 
jitter performance is limited by the resolution [19]. Specifically, 
the jitter of a binary CDR is set by the minimum resolution of 
the phase interpolator because of its bang-bang operation [6]. If 
there are delays in the recovery loop, however, the jitter will be 
more than the minimum resolution. In this paper, we introduce 
a novel CDR algorithm that has immunity to the effect of delays 
in recovery loop. 

Fig. 7 shows examples of various CDR algorithms in oper- 
ation. It is assumed that the incoming data stream has some 
frequency offset from the reference clock, as allowed by IEEE 
802.3ae standard. In the case that a CDR circuit has no delay, 
Fig. 7(a) immediately updates the timing, and its recovered 
clock jitter is limited by one minimum resolution of the phase 
interpolator. In the case of a CDR circuit with delays in the 
recovery loop, however, it has more jitter due to the delayed 
timing update, as shown in Fig. 7(b). 

To eliminate the effect of delays in the recovery loop, this 
transceiver adopts a delay-immune CDR algorithm. The moti- 
vation for developing a delay-immune CDR algorithm is that the 
CDR circuit should ignore the excess UP/DN indication caused 
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Fig. 7. 


from the delays in the recovery loop. To ignore the false indica- 
tion, the CDR circuit compares the current UP/DN value to the 
previous UP/DN values of the same number as the delays in the 
recovery loop. If the current UP/DN value is same as the pre- 
vious UP/DN values, the CDR circuit does not change the cur- 
rent timing, since the current UP/DN value is generated before 
the timing updates of the previous UP/DN values due to the de- 
lays in the recovery loop. The accumulator in the recovery loop 
does this operation. As a result, the CDR circuit achieves ideal 
bang-bang operation and the recovered clock jitter is limited by 
one minimum resolution of the phase interpolator, as shown in 
Fig. 7(c). 

Fig. 8 shows an implementation block diagram of the delay- 
immune CDR circuit. Clock recovery employing a dual-loop 
phase-selection and phase-interpolation scheme [19] is used. A 
multi-phase PLL supplies evenly spaced phases and clock re- 
covery preformed by the phase-selection and phase-interpola- 
tion loop is completely independent of the PLL. A multiplexer 
selects a pair of adjacent clock phases to define a phase in- 
terval for interpolation. The phase interpolation is then sup- 
plies sampling clock to the input samplers. The input samplers 
sample the output of the adaptive equalizer by 2X oversampling 
to yield center samples and transition samples. These samples 
are aligned to give Din[9:0] and Dt[9:0] respectively. The transi- 
tion detector generates the Up[9:0] and Down[9:0] vectors from 
the input Din[9:0] and Dt[9:0] vectors. The 8 b/10 b encoder 
ensures there is at least one transition in every coded word, and 
the comparator counts the number of Is in each vector and com- 
pares their values. As an output, it generates an Inc/Dec signal 
according to the compared result. A control block is used to a 
prevent phase discontinuity at quadrant crossings [19]. The ac- 
cumulator detects a false indication caused by the delays in the 
loop, and the final phase selection state is latched and fed to the 
multiplexer and the phase interpolator. 

The comparator implementation is straightforward and 
comprises a binary 10-bit adder to encode bit vector to binary 


Examples of various CDR algorithms operating, showing maximum jitter. 
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Fig. 8. Clock and data recovery architecture. 

number and a 4-bit binary comparator. However, it is difficult to 
meet the timing requirements using this straightforward digital 
implementation. Fig. 9 shows a novel approach to perform the 
same function. The basic idea of this algorithm is trellis passing 
according to each bit value of Up[9:0] and Dn[9:0]. Instead of 
counting the number of Is in the vector, the state is changed for 
each bit value. If Up[n] and Dn[n] have the same value, the next 
state has the same position. If they differ, however, the state 
moves toward the direction of 1. Because all-the 10 bits apply 
at the same time, the total time delay is 10 times that of one 
state transition. Fig. 9 shows an implementation example of the 














state comparator. It only consists of three two-input AND gates 
and one three-input OR gate, and by using Boolean operation, 
it can be converted to NAND—NAND or NOR-NOR logic. It has 
very simple implementation architecture allowing high-speed 
operation. 


ITV. MEASUREMENTS 


This transceiver chip was fabricated in 0.18-j.m standard 
CMOS technology with 1.8-V supply voltage and packaged in 
256-pin PBGA. It provides 12.5-Gb/s full duplex raw data rate 
for a single 10-Gb XAUI. The power consumption is 178 mW 
per channel and total 718 mW at 3.125-Gb/s full duplex.(Tx/Rx 
simultaneous) data rate. 

Fig. 10 shows the performance of the PLL. It locks to 
156.25-MHz crystal oscillator reference and gives 5.036-ps 
(rms) jitter and 40-ps (p-p) jitter. The PLL rms jitter reduces 
about 1.8% — Arms jitter/%—AVaq as supply voltage increases. 
The input multiplexing transmitter performance is shown in 
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Fig. 11. The TX eye diagram and performance (a) in terms of RMS jitter and 
(b) in terms of total jitter. 


Fig. 11. Fig. 11(a) is a result with digital sampling oscilloscope 
showing 5.045-ps (rms) jitter and 46-ps (p-p) jitter. Fig. 11(b) 
is a result with LeCroy SDA6000 equipment showing 78.7-ps 
total jitter, 5.46-ps random jitter, and 756-fs deterministic jitter. 
The transmitter output has differentially adjustable amplitude 
with a maximum of 1600 mV from 800 mV. 
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TABLE I 
PERFORMANCE SUMMARY 





Fig. 12. Performance comparison of CDR algorithm. (a) Conventional 
bang-bang algorithm and (b) delay-immune algorithm. 
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Fig. 13. Transceiver die photograph. 


Fig. 12 shows the performance of delay-immune CDR circuit. 
Fig. 12(a) shows bang-bang controlled CDR performance with 
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Fig. 14. Performance comparison table for previous serial-link transceivers. 


delays and has a jitter of 37.69 ps (rms) and 196 ps (p-p), while 
Fig. 12(b) shows the delay-immune CDR jitter performance of 
11.36 ps (rms) and 64 ps (p-p). 

The BER measurements are performed using a built-in pat- 
tern generator and BER tester. With various bit patterns and 
packet patterns specified in 802.1 1ae, the transceiver shows the 
BER performance lower than < 4.5 x 107°. 

All the building blocks of the multiphase PLL are fully inte- 
grated on the chip including the loop filter. The chip occupies 
2.3mm x 2.3 mm of die area. The transceiver die photo is shown 
in Fig. 13. Table I summarizes the transceiver chip performance 
and Fig. 14 shows a comparison matrix with previous work. 
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V. CONCLUSION 


A four-channel 3.125-Gb/s/ch CMOS serial-link transceiver 
is fabricated in a 0.18-~4zm CMOS process. An input multi- 
plexing transmitter with a low-jitter PLL shows only 46-ps 
peak-to-peak jitter. For a receiver, a mixed-mode LMS adaptive 
equalizer is implemented to reduce ISI and to improve BER 
performance. A delay-immune CDR algorithm is proposed and 
implemented for clock recovery loop stability. Recovered clock 
jitter is measured to 64 ps (p-p). Because of these techniques, 
the measured BER performance of the overall transceiver is 
lower than < 4.5 x 10719. 
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Low-Voltage Low-Power LVDS Drivers 


Mingdeng Chen, Member, IEEE, Jose Silva-Martinez, Senior Member, IEEE, Michael Nix, and 
Moises E. Robinson, Member, IEEE 


Abstract—Two low-voltage low-power LVDS drivers used for 
high-speed point-to-point links are discussed. While the previously 
reported LVDS drivers cannot operate with low-voltage supplies, 
the proposed double current sources (DCS) LVDS driver and the 
switchable current sources (SCS) LVDS driver are suitable for 
low-voltage applications. Although static current consumption is 
greater than the minimum amount required by the signal swing, 
the DCS LVDS driver is simple and fast. The SCS LVDS driver, by 
dynamically switching the current sources, draws minimum static 
current and reduces the power consumption by 60% compared to 
previously reported realizations. Both drivers were fabricated in a 
standard 0.35-44m CMOS process; they are compliant with LVDS 
standards and can operate at data rates up to gigabits-per-second. 


Index Terms—Back-plane drivers, fast data communication cir- 
cuits, input/output (I/O) drivers, low-voltage differential signaling 
(LVDS), low-voltage low-power integrated circuits. 


I, INTRODUCTION 


HE ever-increasing processing speed of microprocessor 
T motherboards, optical transmission links, chip-to-chip 
communications, etc., is pushing the off-chip data rate into 
the gigabits-per-second range. While scaled CMOS technolo- 
gies continue to enhance on-chip operating speeds, off-chip 
data rates have gained little benefit from the increased silicon 
integration. This is primarily due to the excessive power con- 
sumption necessary for driving impedance-controlled electrical 
interconnects, which leads to an increase in costs related to 
packaging and thermal management [1]. In the past, off-chip 
high data rates were achieved by massive parallelism, with 
the disadvantages of increased complexity and cost for the 
IC package and the printed circuit board (PCB). Therefore, 
it is beneficial to move the off-chip data rate to the range of 
Gb/s-per-pin or above. Reducing the power consumption is also 
critical for battery-powered portable systems as well as some 
other systems in order to extend the battery life and reduce the 
costs related to packaging and additional cooling systems. 

Scalable Coherent Interface (SCI) is a high-speed packet 
transmission protocol that efficiently provides the functionality 
of bus-like transactions (read, write, lock, etc.), but it uses a col- 
lection of fast point-to-point links instead of physical buses to 
reach higher speeds. The initial physical implementations were 
based on emitter coupled logic (ECL) signal levels [2], which 
consume more power than is practical in a low-cost workstation 
environment. Low-voltage differential signaling (LVDS) is a 
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gigabits-per-second operation. 


technology developed to provide a low-power and low-voltage 
alternative [3] to ECL and other high-speed I/O interfaces for 
point-to-point transmissions. LVDS achieves higher speed and 
significant power savings by means of a differential scheme for 
transmission and termination, in conjunction with low voltage 
swing. 

In this paper, two low-voltage, low-power, and high-speed 
LVDS drivers are discussed. Both drivers can operate with data 
rates of 1 Gb/s and above, and they are fully compatible with 
IEEE Std 1596.3-1996 [3] for general-purpose links and IEEE 
Draft P802.3ae/D5.0 [4] for XSBI interfaces. Section II dis- 
cusses the LVDS interfaces, the typical LVDS drivers, and the 
design challenges for low-voltage operation. In Section III, the 
low-voltage, low-power LVDS drivers are discussed and some 
of the simulation results are also presented. The experimental 
results and conclusions are addressed in the last two sections. 


Il. TYPICAL LVDS DRIVERS 


An LVDS interface, as shown in Fig. 1, has a low-voltage 
swing (250-400 mV); it is connected point-to-point and 
achieves very high data rates (up to 500 Mb/s per signal pair) 
and reduced power dissipation [3]. LVDS uses differential data 
transmission and the transmitter is configured as a switched-po- 
larity current generator. A differential load resistor at the 
receiver end provides optimum line impedance matching. 

Due to the imperfect termination, package parasitics, compo- 
nent tolerances or crosstalk [5], there are reflected waveforms 
returning to the driver. As data rates push significantly above 
500 Mb/s and connectors are added, an additional termination 
resistor is usually placed at the source end to suppress reflected 
waves, and the LVDS signaling can be substantially enhanced. 
Low voltage differential signaling is a standardized data trans- 
mission format that is widely used for serial data transmissions; 
as shown in Fig. 2, a differential signal is centered at a common- 
mode voltage of about 1.25 V. The maximum magnitude of the 
differential signal is 400 mV. Typically, the LVDS signal varies 
in magnitude from 1.05 to 1.45 V. 

A typical bridged-switches LVDS driver behaves as a cur- 
rent source with switched polarity as shown in Fig. 3(a) [3]. 
The bias current J, is switched through the termination resis- 
tors according to the data input, and thus produces the correct 
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implementation [3]. 


differential output signal swing. A possible implementation of 
the typical LVDS driver is shown in Fig. 3(b). It uses four MOS 
switches (MI—M4) in a bridged configuration. If switches M1 
and M4 are on (D = LOW), the polarity of the output current 
is positive together with the differential output voltage. On the 
contrary, if switches M1 and M4 are off (switches M2 and M3 
are on), the polarity of the output current and voltage is reversed. 

The typical LVDS driver works well if the supply voltage 
(Vpp) is 2.5 V or greater. It is simple and only needs minimum 
static current consumption to produce the required output signal 
swing. But when the supply voltage drops below 2 V (e.g., 1.8 V 
for 0.18-j4m CMOS technology), the typical LVDS driver does 
not have enough headroom in the Vp p direction. This is mainly 
due to the finite on-resistance of the PMOS transistor switches 
and the large amount of current (nominally 6.4 mA for a signal 
swing of 320 mV and a 50-2 termination resistance) flowing 
through the switches. The voltage drop across the transistor con- 
sumes headroom and it demands relatively high voltage supplies 
for the LVDS driver to operate properly. 
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Fig. 4. DCS LVDS driver. (a) Model and (b) potential transistor level 
realization. 
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Fig. 5. SCS LVDS driver model. 


Ill. LOw-VOLTAGE, Low-POWER LVDS DRIVERS 
A. Double Current Sources (DCS) LVDS Driver 


A solution to the headroom issue discussed: in Section II is 
to remove the top PMOS switches in the typical LVDS driver 
[Fig. 3(b)] and replace them by two PMOS current sources, 
as shown in Fig. 4(a); We call this structure a double current 
sources (DCS) LVDS driver. In order to produce the same signal 
swing, the bottom NMOS current source is required to sink 2Jy, 
which doubles the static current consumption as required by the 
output signal swing. Accordingly, the embodiment of Fig. 4(b) 
consumes more current than the embodiment of Fig. 3(b). In 
addition, the NMOS transistor switches and the bottom NMOS 
current source are required to be larger than the corresponding 
transistors in Fig. 3(b). If an integrated circuit includes a plu- 
rality of LVDS drivers, the increased current consumption and 
transistor dimensions may limit their applications. Also, larger 
transistor dimensions increase the total pad capacitance and so 
reduce the pin bandwidth. 


B. Switchable Current Sources (SCS) LVDS Driver 


Another solution to the headroom issue is shown in Fig. 5. 
Instead of using two constant current sources at the top, two 
switchable current sources are used [6]. Depending on the 
data input, one of the two switchable current sources will 
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Fig. 6. SCS LVDS driver with control circuit. 
conduct current. This current flows through the termination 
resistors and produces the output voltage swing. Notice that the 
bottom NMOS current source only needs to sink J;, leading to 
minimum static current consumption. 

Fig. 6 shows the basic principle behind the proposed SCS 
LVDS driver. When Von, a reference voltage, is applied to the 
gate of M1(M2), the transistor conducts a current Ip, which 
is a copy of a well-controlled reference current, regardless of 
the process, voltage, and temperature (PVT) variations. Here, 
transistors M1 and M2 and switches $1, $2, S3, and S4 act as 
switchable current sources. For instance, when D is LOW (M1 
is ON) then M1 conducts current Jp, and it flows throughout 
the load resistors and M4 to produce the proper output voltage 
swing. 

There are two design issues that need to be addressed for 
the SCS LVDS driver to operate properly. First, we must de- 
termine how to’ generate the reference voltage Von such that 
Ip remains at the proper value regardless of the PVT varia- 
tions. Second, since the PMOS switchable current sources need 
to conduct large currents, their transistor dimensions are large 
as well as their parasitic capacitances. So the question is either 
how to switch the gate voltages of M1 and M2, or how to quickly 
charge and discharge the parasitic capacitors at the gates of M1 
and M2. The design issues mentioned above are addressed in 
the SCS LVDS driver shown in Fig. 7; its operation is explained 
as follows. 

The SCS LVDS driver contains two parts: the switchable cur- 
rent source control module and the core of the LVDS driver. 
The left part of Fig. 7 is the control module, and it is used 
to generate Von such that when it is applied to the gate of 
M1(M2) its drain current Ip is proportional to J,.¢. The cascode 
transistor M7 and amplifier Amp form a regulated-gain control 
(RGC) loop. This RGC loop is used to set M6’s drain voltage to 
Vp_ret(= 1.41 V). It is important to make sure that the output 
common-mode voltage and signal swing are maintained; hence 
the higher output voltage of Vo,(Von) is fixed, and it is de- 
fined by Vp_ret(= Vocm_ret + Vo,swing/2), regardless of the 
PVT variations. Vjcm_ret is the output common-mode reference 
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voltage, and V, swing iS the required signal swing. For instance, 
for an output common-mode voltage of 1.25 V and an output 
signal swing of 320 mV, ideally the higher LVDS output voltage 
Vop(Von) should be 1.41 V. By setting the drain voltage of M6 
to Vp_rep, we have good matching for the current mirror com- 
posed of M6 and M1 (M2). Another issue worth mentioning is 
that the switchable current source control module can be shared 
by several LVDS drivers, but independent buffers are used for 
each driver in order to minimize the signal feedthrough. 

The right part of Fig. 7 is the core of the SCS LVDS driver. 
The switchable current sources are used to generate current Ip 
and they are composed of transistors M1 and M2, buffer-con- 
nected amplifier Buf-A, switches $1 and S2, and the pull 
up/down circuits. The pull up/down circuits are used to quickly 
change the gate voltages of M1 and M2, i.e., to quickly charge 
or discharge the parasitic capacitors associated with the node 
Veate- The buffer-connected amplifier Buf-A is used to isolate 
the DC voltage Von from the data controlled switches. It also 
provides “‘fine adjustment” to the gate voltage of M1(M2) when 
the switch S1(S2) is closed, while the pull up/down circuit, 
driven by the input data, provides coarse control. The CMFB 
is used to set the output common-mode voltage to the desired 
reference voltage Vocm_ref- 

The operation of the switchable current sources is explained 
as follows. If data D is LOW, then switch S1 is ON and switch 
S2 is OFF. The M1’s gate voltage is pulled down to Von through 
the pull up/down circuit during the data transition while M2’s 
gate voltage is pulled up close to Vpp. M1 conducts current Ip 
and M2 is OFF. The current Jp flows through the termination 
resistors and produces the signal swing. 


C. Pull Up/Down Circuits 


An active pull up/down circuit is shown in Fig. 8 [7]. In this 
structure, both pull up and pull down sections produce short 
periods of current pulses at the data’s transition edges. These 
current pulses are used to charge/discharge the parasitic capac- 
itors and so to pull up/down the switchable current source gate 
voltages. Some design issues are associated with this active pull 
up/down circuit. First, the circuit itself consumes huge dynamic 
power since the several delay cells used and the high data rate. 
Second, the currents produced by the pull up/down circuit 
are finite and they limit the speed of the charging/discharging 
process. Also, since the currents are produced by PMOS and 
NMOS transistors, respectively, the charge injected into the 
capacitors may not equal the charge extracted from the capac- 
itors. This difference should be supplied by the “Buffer’’ as 
shown in Fig. 7, and this requires a fast circuit implementation 
that demands more power consumption. 

Instead of using an active pull up/down circuit, we propose to 
use passive capacitors C'pp driven by the input data for the SCS 
LVDS driver; the principle of operation is shown in Fig. 9. The 
passive pull up/down circuit does not have the drawbacks faced 
by the active pull up/down circuit mentioned above. The capac- 
itor Cpp, driven by the input data D, is used to pull up/down 
M1(M2) gate voltage with drastically reduced transition time 
and to provide coarse control over the gate voltage Vyate. The 
parasitic capacitor C'p associated with the node Vyate, and ca- 
pacitor Cpp form a capacitive voltage divider. When D goes 
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down, Veate equals Von and Ip is determined by J,.¢, while 
Cpp is charged to Von. During the low-high transition of D, 
the switch resistance is high and the Cpp’s injected charge is 
mainly absorved by C’p, turning off the transistor. The resulting 
waveforms of the data and the gate voltage Veate are also shown 
in Fig. 9. It is easy to show that the M1(M2) gate voltage varia- 
tion AV,ate can be expressed as 


Cre 
AV gate Te C nt 


ees Vip (1) 
~pp to ae, 


where AVgate is defined as AVgate = Vorr — Von. It is as- 
sumed that data D varies from Vpp to zero. 

It is worth mentioning that when the transistor M1 (M2) is 
turned off, its gate voltage Vorr does not need to be Vpp; 
for fast circuits, it is better for Vorr to be lower than Vpp 
such that the transistor operates in subthreshold region. In this 
way, we can turn on/off the switchable current sources more 
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quickly and minimize the dynamic power consumption needed 
to charge/discharge C,, and C,,, as long as the current flowing 
through the OFF switchable current source Jorp is negligible. 
By choosing a proper limit for Jopr, we can find the gate 
voltage variation AVgate such that Jorr does not exceed this 
limit. Then, the value of the capacitor C;,, can be determined as 
a © - AY gate (2) 
Von —AVeate 

For this design, C, is around 6.4 pF and C;,, is chosen to 
be 0.8 pF, which occupies 1000 jum? with poly-poly imple- 
mentation. The switches are implemented with transmission 
gates; transistor dimensions are 60/0.4 and 20/0.4 for PMOS 
and NMOS, respectively. The current flowing through the OFF 
switchable current source Jorp is around 240 pA and AVyate 
is around 200 mV. Notice that the data D drives an equivalent 
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Since M1 (M2) is working in the subthreshold region, its current 
is very small hence the supply variation has very limited effect 
on the output signal amplitude. 

Compared to the active pull up/down circuit, this passive pull 
up/down circuit is faster as a result of the capacitors used, con- 
sumes less power, and the up/down voltage changes are symmet- 
rical. With symmetrical voltage changes, the switches $1 and 
S2 can be small and the speed of the Buf-A is relaxed. Also, the 
driver’s architecture is simpler and, therefore, more robust. 


D. Simulation Results 


The transistor dimensions of the DCS and SCS LVDS driver 
cores are shown in Table I. The simulated DCS LVDS driver 
output common-mode and differential-mode voltages with data 
rate of 1.25 Gb/s are shown in Fig. 10. In this simulation, the 
models of the electrical static discharge (ESD) device, bonding 
wire, and package are included. Also, the termination resistor 
and load capacitors at the receiver end are included. Notice that 
both common-mode and differential-mode output voltages are 
within the LVDS standard specifications. 

From the discussions in the aforementioned sections, it can 
be seen that the key design issue of the SCS LVDS driver is to 
control the switchable current source gate voltage Vgate and so 
the corresponding drain current. Fig. 11 shows the simulation 
results for the switchable current source gate voltage Veate (top 
trace), transistor drain current Jp (middle trace) and the cor- 
responding output differential voltage (bottom trace); the load 
model was simplified in order to see Vzate change more clearly. 
Notice that the gate voltage Vzare and the corresponding drain 








current Ip switches properly. The transition time is only around 
240 ps and it can be seen that the rising time and falling time 
of the output signal are within the specifications (300-500 ps). 
The small transition time is mainly due to the passive capacitors 
used for the pull up/down circuit, and operating the switchable 
current sources in a subthreshold region when they are turned 
OFF. The gate voltage variation AV,.+¢ is around 200 mV, and 
the drain current Jon and Jorr are around 6.4 mA and 240 A, 
respectively. Notice that the gate voltage V,.a+- and the drain cur- 
rent Jp present small variations. They are due to the transients 
of charging/discharging the parasitic capacitances. 


ITV. EXPERIMENTAL RESULTS 


Both the DCS and SCS LVDS drivers have been fabricated in 
the TSMC 0.35-j4m CMOS process through the MOSIS service; 
the active die areas are 0.11 mm? and 0.14 mm’, respectively. 
The chip micrograph is shown in Fig. 12 and was packaged in 
a 64-pin ceramic quad flat package. According to the experi- 
mental results, the DCS LVDS driver operates properly for a 
data rate up to 1.4 Gb/s and the SCS LVDS driver operates for 
data rates up to 1.2 Gb/s. Those shortcomings might be allevi- 
ated if more advanced processes or N-type switchable current 
sources are used. 

Figs. 13 and 14 show the DCS LVDS driver differential output 
eye diagrams with 2?! — 1 pseudorandom bit sequence (PRBS) 
pattern and data rates of 680 Mb/s and 1.0 Gb/s, respectively. 
The single-ended output signal swings are around 340 mV and 
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DCS and SCS LVDS drivers chip micrograph. 


the measured root-mean-square (RMS) jitters are 15 and 36 ps, 
respectively. The eye openings are 90% and 80%, respectively. 
Figs. 15 and 16 show the SCS LVDS driver differential eye dia- 
gram with 2°! — 1 PRBS at data rates of 680 Mb/s and 1.0 Gb/s, 
respectively. The differential output signal swings are 680 mV 
and the measured RMS jitters are 28 and 50 ps, respectively. 
The eye openings are 85% and 60%, respectively. 

Compared to the DCS LVDS driver, the SCS LVDS driver 
presents larger jitter and narrower open eyes. Several factors 
contribute to this. First, the rising and falling times of the SCS 
LVDS driver output signal are larger than those of the DCS 
LVDS driver output signal, which is due to the finite transition 
times of the gate voltage and drain current of the switchable cur- 
rent sources. Second, while the drain current of the PMOS cur- 
rent sources in the DCS LVDS driver remains constant, the drain 
current of the switchable current sources presents some varia- 
tions, which is due to the transients of charging/discharging the 
parasitic capacitances. Also, the effect of the charge injection 


Switchable current source gate voltage (top), drain current (middle), and the output differential voltage (bottom). 
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Fig. 14. DCS LVDS driver eye diagram (data rate = 1.0 Gb/s), 


on the driver’s output nodes is more pronounced for the SCS 
LVDS driver than for the DCS LVDS driver. 

The total current consumption (including both static and dy- 
namic) of the two LVDS structures for different data rates are 
given in Table II. The dynamic power consumed by the parasitic 
capacitance of the NMOS switches has been neglected for both 
structures. While in this table the current consumption of the 
DCS LVDS driver only consists the static tail current, that of the 
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Fig. 15. SCS LVDS driver eye diagram (data rate = 680 Mb/s). 
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Fig. 16. SCS LVDS driver eye diagram (data rate = 1.0 Gb/s). 


SCS LVDS driver includes the current drawn by the buffer-con- 
nected amplifier Buf-A, the dynamic current consumed by the 
parasitic capacitance of the switchable current sources, and the 
static tail current. It can be seen that the SCS LVDS driver draws 
much less current than the DCS LVDS driver. 

A comparison among these two structures and a previously 
reported LVDS driver [8] is shown in Table III. This reported 
driver is based on typical LVDS configurations, except that it 
uses all NMOS switches to reduce the charge injection effects. 
Another reported LVDS driver requires an external resistor and 
two reference. voltages [9]. Notice that both the DCS and SCS 
LVDS drivers consume less power than previous realizations. 
Especially for the SCS LVDS driver, by dynamically switching 
the current sources, it reduces the power consumption by 60% 
compared to the previous implementations (if the same signal 
swing is maintained). In addition, while the previously reported 
LVDS drivers cannot operate properly with low-voltage sup- 
plies, both the DCS and SCS LVDS drivers are suitable for 
low-voltage supply applications, and they are still compliant to 
LVDS standards and operate properly at very high data rates. 

In addition to the low-power consumption, the other bene- 
fits of the low-voltage supply drivers are reduced EMI and costs 
related to the packaging and cooling systems. Being able to op- 
erate with low-voltage supplies makes it possible to use the same 
supply for both the core circuits and the I/O drivers, which can 
simplify both circuit and PCB design. 


V. CONCLUSION 


Two LVDS driver structures suitable for very low-voltage 
supplies (as low as 1.8 V) are discussed. The DCS LVDS driver 
is simple and fast. Despite the dynamic power consumed by 
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TABLE II 
CURRENT CONSUMPTION FOR DCS AND SCS LVDS DRIVERS 
Data Rate (Mb/s) 680 1000 
DCS Taverage (MA) 5 128 Pi 128 ‘ 
SCS Taverage (MA) $5.1 92° 499 
TABLE Iil 


COMPARISON WITH PREVIOUS REALIZATIONS 
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[8] DCS SCs 
Technology | 0.35um CMOS | 0.35um CMOS | 0.35um CMOS 
Output Voltage Swing (mV) ee 42 : 340 340 
Maximum Data Rate (Mb/s) 1200 1400 1200 
Static Power Consumption (mW) 43 23 12.8 
Cell Size (mm?) | o47 0.11 0.14 
Supply Voltage (V) 3:3 1.8 1.8 | 




















the parasitic capacitance of NMOS switches, the DCS LVDS 
driver power consumption is almost constant, regardless of the 
data patterns. A drawback of the DCS LVDS driver is that its 
static current consumption is twice the minimum required by the 
output voltage swing. Another drawback is that the transistor di- 
mension of the switches and the bottom NMOS current sources 
are relatively large because of the larger amount of current used, 
therefore die area and parasitic capacitors increase. 

The SCS LVDS driver is more complex compared to the DCS 
LVDS driver, but its most significant advantage is that the static 
current consumption is kept to the minimum as required by the 
output voltage swing and load. Since it is needed to charge/dis- 
charge the parasitic capacitance associated with the switchable 
current sources, the SCS LVDS driver power consumption de- 
pends on the data pattern, even if we neglect the dynamic power 
consumed by the parasitic capacitance of NMOS switches. The 
higher the data rate, the larger the dynamic power consumption 
of the pull up/down circuit is. 
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High-Performance Low-Power Dual Transition 
Preferentially Sized (DTPS) Logic 


Woopyo Jeong and Kaushik Roy, Fellow, IEEE 


Abstract—We present a dual transition preferentially sized 
(DTPS) logic that uses two separate paths—one for the fast prop- 
agation of low-to-high signal and the other for fast propagation of 
high-to-low signal. DTPS logic is suitable for multistage buffers 
and critical sections of datapaths requiring good noise immunity 
and low power dissipation while achieving high performance. 
We derived formulas to obtain optimal tapering factors of mul- 
tistage buffers based on preferentially sized (PS) inverters, and 
implemented DTPS logic using the optimal tapering factors. We 
fabricated datapaths based on static CMOS logic, domino logic, 
and DTPS logic in 0.18-44m technology. DTPS logic shows 15% 
and 16% improvements in performance and power dissipation, 
respectively, over domino, and 42% improvement in performance 
compared to static CMOS. 


Index Terms—Dual transition preferentially sized (DTPS), pref- 
erentially sized logic, tapering factor. 


I. INTRODUCTION 


ITH the scaling of process technology, high perfor- 
mance and low power consumption are becoming 


important issues in circuit design. The use of domino circuits 
is one way to alleviate the problem of high-performance circuit 
design. However, domino circuits consume more power than 
standard CMOS logic, and are susceptible to noise (for scaled 
technologies with low transistor threshold voltage) because 
in the evaluation mode intermediate nodes may be floating 
[1]-[3]. 

In order to achieve good noise immunity and low power con- 
sumption while achieving performance comparable to domino 
logic, we propose dual transition preferentially sized (DTPS) 
logic, which consists of dual monotonic datapaths (one is fast 
for rising transition of input, and the other is for falling tran- 
sition) using preferentially sized (PS) circuits [4]. Since a PS 
inverter chain uses up-sized inverters and down-sized inverters 
alternately to speed up data propagation in evaluation cycle, the 
ratio of output capacitance to input capacitance of even stages of 
multistage PS buffers is different from that of odd stages. Hence, 
different tapering factors should be used for even and odd stages, 
which are also different from the tapering factor of normal in- 
verter chains. We derive formulas for optimal tapering factors of 
multistage buffers based on PS inverters to minimize the propa- 
gation delay. DTPS is implemented based on PS inverter chains 
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Fig. 1. N-stage preferentially sized buffers with dual tapering factors 
(a) starting an up-sized inverter and (b) starting a down-sized inverter. 


using dual tapering factors. DTPS logic is not only suitable for 
multistage buffers but also ideal for critical sections of datap- 
aths requiring high performance and low power consumption. 
We also describe how to design DTPS logic using a high sizing 
ratio in critical paths of design to achieve a very high perfor- 
mance. We fabricated datapaths based on static CMOS logic, 
domino logic, and DTPS logic. The measurement results show 
the advantages of DTPS logic. 


II. PREFERENTIALLY SIZED (PS) LOGIC 


In order to design high performance multistage buffers and 
datapaths using DTPS logic proposed in this paper, we first con- 
sider PS buffers that are the building blocks of the DTPS buffers. 
Then, in order to minimize the delay due to the PS buffers, op- 
timal tapering factors are considered, which are different from 
the tapering factor of normal inverter chains [5], [6]. Fig. 1 
shows some examples of how to adjust the sizes of PS circuit 
style, where s is the sizing ratio and a is the ratio of optimal 
size of the PMOS to NMOS in a static CMOS inverter. /3 is the 
tapering factor, which is the ratio of output capacitance to the 
input capacitance of an inverter. The arrows represent the sizing 
directions of the inverters. In this paper we used sizing ratios 
greater than 1. Since a multistage PS buffer uses up-sized in- 
verters and down-sized inverters alternately, we should use two 
tapering factors for a multistage PS buffer—one for the even 
stages and the other for the odd stages. In Fig. 1, the N-stage 
PS buffer uses two tapering factors (3; and (2). One tapering 
factor ((3,) is used for the even stages, and the other (32) is 
for the odd stages. Hence, the output capacitive load, Cz, is 
Cr = (B1B2)9/9 Cw = (0162) %/?) (1 + a + s)Cyo, where 
Co is the input gate capacitance per unit area. 
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In Fig. 1(a), the low-to-high propagation delay of the first 
stage (tpiy1) and the high-to-low propagation delay of the 
second stage (tpyz2) are given as follows: 
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The total propagation delay is minimum when the 


propagation delays of each stage are same (fpHi1 = 
tptH2 = tpur3.--) [2]. Hence, we can obtain fg ~& 
(a + s)/(1 + a+ s) - 2, from (1) and (2), and the total 
propagation delay (tpyi1 + tptu2 + tpHi3 + --: + tea) 
can be written as follows: 
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In (3), when s = 1, ¢, is the total propagation delay of an 
N-stage normal multistage buffer. An optimal number of stages 
of multistage PS buffers with dual tapering factors (opt), 
which is obtained by solving 0t,/ON = 0, is In(Cz/Cjy). 
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which are different from the optimal tapering factors of normal 
multistage buffers. 

Fig. 2 depicts delays of multistage PS buffers with dual 
tapering factors, which are normalized to the delay of a normal 
multistage buffer (when s = 1). The dotted lines represent 
simulation results for different number of stages, and the solid 
line shows the analytical result obtained from (3). In Fig. 2, 
when the sizing ratio is 5, the delays of multistage PS buffers 
are 56% less than those of normal multistage buffers. However, 
since the propagation delay in the precharge cycle is much 
larger than that of the evaluation cycle, conventional PS logic 
does require a clock signal, though only selective logic gates 
may require the clock to reset the PS logic in the precharge 
cycle [4]. This increases the clock load and, hence, the power 
consumption. 








III. DUAL TRANSITION PREFERENTIALLY SIZED (DTPS) LOGIC 


We propose to use DTPS logic, in which the sizes of PS in- 
verters on each datapath are determined based on the optimal 
tapering factors, to achieve high performance and low power 
dissipation. DTPS logic does not require a clock signal. Fig. 3 
shows an example of DTPS logic that achieves high perfor- 
mance by duplicating signal paths: both paths consist of PS 
logic, in which one signal path is for fast rising transition of 
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input, while the other is for fast falling transition. A combiner 
detects the earliest transition, latches it, and then transfers the 
data to the next stage. Hence, DTPS logic can achieve fast prop- 
agation delay both in evaluation and precharge modes. For ex- 
ample, in Fig. 3, if the input toggles from high-to-low, the top 
path is faster than the bottom path. Hence, though both nodes 
N2_T and N2_B transit from high to low, N2-T transits faster 
than the node N2_B. The high-to-low transition on N2_T turns 
on MP3, while N3_T stays at low, which makes output transit 
from high to low. If the input toggles from low to high, the node 
N2.B transits from low to high faster than the node N2_T. The 
low-to-high transition on N2_B turns on MN2, while N3_B stays 
at high, which makes output transit from low to high. 

The circuit diagram shown in Fig. 3 is valid only when low 
sizing ratio (s) is used, in which the difference between fast 
propagation delay and slow propagation delay is less than the 
clock period. To achieve high performance, highly preferen- 
tially sized inverters may be required. However, using highly 
preferentially sized inverters (for a certain sizing ratio for which 
Tp-_slow > Tp_tast + Teycle) can make the transition of slow data 
due to previous input signal and the transition of fast data due 
to the current input signal occur at almost the same time at a 
certain node, creating a glitch (spurious transition). 

Fig. 4 shows an example of this functional problem of DTPS 
buffers with a high sizing ratio. The previous input, IN[i-1], 
toggles from low to high, and the current input, IN[i], toggles 
from high to low. Propagation due to current input (IN[i]) is 
faster than due to previous input (IN[i-1]) on the top datapath 
in Fig. 3 because transition directions of PS inverters due to 
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Fig. 4. Timing diagram of the DTPS buffer having high sizing ratio 
(Latsow 2 Tactast + Deycte): 





















































Fig. 5. 


Cross-path DTPS logic applied for critical sections of datapaths. 


the current input are the same as their sizing directions. Prop- 
agation due to previous input (IN[i-1]) is slow on the top path, 
and hence, fast high-to-low transition due to current input and 
slow low-to-high transition due to the previous input occurs si- 
multaneously at N2_T. It produces a glitch at N2_T and can 
not turn on MP3. Hence, even though the input toggles from 
high-to-low, the output does not toggle while keeping the pre- 
vious data (high). This problem occurs when the delay of the 
slow path is larger than summation the delay of fast path and 
cycle time (ipa > dp Fase sh Teyetey: 

To solve the spurious transition problem of the multistage 
DTPS buffer having highly preferentially sized inverters, we 
propose a multistage cross-path DTPS buffer that uses extra 
logic to take care of the robustness problem by reducing the 
propagation delay in the slow path. The proposed cross-path 
DTPS circuit techniques are applicable to multistage buffers 
and critical sections of datapaths requiring very high perfor- 
mance with low power consumption. Fig. 5 shows the proposed 
cross-path DTPS logic applied to critical sections of the datap- 
aths requiring very high performance. The compound gates (G1 
and G2) are added to handle the robustness problem of DTPS 
logic mentioned above. The architecture of DTPS in a data- 
path is the same as that in the multistage DTPS buffer except 
that a datapath consists of combinational gates like NAND, NOR, 
or other complex gates. We can partition combinational gates 
on each datapath into two parts: gates having a critical input 
signal and gates having noncritical input signals. For example, 
in Fig. 6, a 4-input NAND gate (G1) on the top datapath can be 
partitioned into 2-input NAND gate (G3) and 3-input NOR gate 
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Fig. 6. DTPS logic for datapaths (a) before logic restructuring and (b) after 
logic restructuring. 







































































Fig. 7. 


Chip microphotograph. 


(G5), and G2 on bottom datapath can also be partitioned into 
G4 and G5. Since G5 is common, it can be shared as shown in 
Fig. 6(b). Gates G3 and G4 are on the critical paths, however, 
G5 is on the noncritical path. Logic restructuring reduces gate 
fan-in and the size of transistors on the critical paths, which de- 
creases load capacitance on the critical paths, thereby reducing 
the delay and the layout area [2]. 

In Fig. 5, when the input transits from low to high, data prop- 
agation through the top path is faster than through the bottom 
one. Hence, NTO goes high faster than NBO while NT1 stays at 
high, which makes NT2 transit from high to low. Low-to-high 
transition of NTO also makes NB2 transit from high to low 
independent of the data at node NBO, i.e., high-to-low transi- 
tion of NB2 occurs before low-to-high transition of NBO. In 
this case, NT1 should transit from high to low after NBO tran- 
sits from low to high. If NT1 transits before NBO transition, 
a glitch occurs at NT2. On the other hand, when input tran- 
sits from high to low, NT2 on the top path is determined by 
the fast high-to-low transition of NBO, while NT1 is low. The 
delay of the slow path of this DTPS buffer, 74 .),.,, is defined 
as 0.5 - (Ta_tast + Ta_siow) + Tcomp, where Tcomp is the delay 
due to a compound gate (J) oy, < Ta_stow). Hence, inserting 
extra component gates for determining slow path can reduce the 
delay of the slow path and remove the glitch problem of DTPS 
logic with high sizing ratio mentioned earlier. 
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TABLE I 
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Fig. 8. Measured output waveforms of (a) bypass and (b) DTPS logic. 


IV. EXPERIMENTAL RESULTS 


We fabricated datapaths based on DTPS, domino, and static 
CMOS logic using TSMC 0.18-44m CMOS technology, and 
compared DTPS logic with domino logic and static CMOS 
logic with respect to performance and power consumption. 
Fig. 7 shows the chip microphotograph. It consists of a by- 
passing path, three datapaths based on DTPS logic, domino 
logic, and static CMOS, and multiplexeers (muxes) to select 
one of the datapaths and bypass path. Fig. 8 shows that the 
measured delays of the datapath based on DTPS logic and the 
bypass path are 18.0 and 12.5 ns, respectively. Hence, the real 
delay of DTPS logic is 5.5 ns. Using the same method, the 
delays of datapaths based on domino logic and static CMOS 
logic are obtained. 

Table I summarizes the measured delays and power con- 
sumptions of datapaths of different logic styles. It shows 15% 
and 16% improvements in performance and power, respec- 
tively, over domino logic. DTPS and domino logic show 42% 
and 31% delay improvements over the static CMOS logic. 





11466um? 





7357um* 6200um 


V. CONCLUSION 


In this paper we proposed DTPS logic, which is suitable for 
multistage buffers and critical sections of datapaths requiring 
a very high performance with low power consumption. We 
derived expressions for optimal tapering factors of multi- 
stage buffers based on PS inverters to minimize the propagation 
delay. Analytical results show that PS buffers with dual tapering 
factors can achieve up to 13% performance improvement over 
ones using one tapering factor. For the PS buffers using dual ta- 
pering factors, the difference between the analytical results and 
the simulation results is less than 10%. We fabricated test chip 
for datapaths based on DTPS logic, static CMOS, and Domino 
logic. The measured results show 15% and 16% improvements 
in performance and power, respectively, over Domino and 42% 
delay improvement over the static CMOS logic. 
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Design Considerations for Soft Embedded 
Programmable Logic Cores 
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Abstract—As integrated circuits become increasingly more 
complex and expensive, the ability to make post-fabrication 
changes will become much more attractive. This ability can be 
realized using programmable logic cores. Currently, such cores are 
available from vendors in the form of “hard” rectangular layouts. 
In this paper, we focus on an alternative approach for fine-grain 
programmability: vendors supply a synthesizable RTL version of 
their programmable logic core (a “soft” core) and the integrated 
circuit designer synthesizes the programmable logic fabric using 
standard cells. Although this technique suffers in terms of speed, 
density, and power overhead, the task of integrating such cores 
is far easier than the task of integrating “hard” cores into an 
ASIC or SoC. When the required amount of programmable 
logic is small, this ease of use may be more important than the 
increased overhead. This paper presents two synthesizable “soft” 
programmable logic core architectures and describes their asso- 
ciated place and route issues. We compare the two architectures 
to each other, and to a “hard” programmable logic core. We also 
show how these cores can be made more efficient by creating a 
nonrectangular architecture, an option not usually available to 
“hard” core vendors. Finally, a proof-of-concept integrated circuit 
containing one of these cores is described. 


Index Terms—Field-programmable gate arrays, programmable 
logic, SoC design. 


I. INTRODUCTION 


ECENTLY, we have witnessed impressive improve- 

ments in the achievable density of integrated circuits. 
In order to maintain this rate of improvement, designers need 
new techniques to manage the increased complexity inherent 
in these large chips. One such emerging technique is the 
system-on-a-chip (SoC) design methodology. In this method- 
ology, pre-designed and pre-verified blocks, often called cores 
or intellectual property (IP), are obtained from internal sources 
or third-parties, and combined on a single chip. These cores 
may include embedded processors, memory blocks, interface 
blocks and components that handle application specific pro- 
cessing functions. Large productivity gains can be achieved 
using this approach. In fact, rather than implementing each of 
these components separately, the role of the SoC designer is to 
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integrate them onto a chip to implement complex functions in 
a relatively short amount of time. 

One major issue today in SoC design is the overall design cost 
in terms of engineering costs, the cost of IP blocks and the rising 
costs of masks in advanced technologies. For this reason, it is de- 
sirable to construct programmable SoCs to amortize the cost of 
a single design across many related applications. Furthermore, 
the cost of errors in the design can be significant. No matter 
how seamless the SoC design flow is made, and no matter how 
careful an SoC designer is, there will inevitably be some chips 
that have problems that are found after fabrication. This may be 
due to design errors not detected by simulation or it may be due 
to a change in design requirements. While this type of problem 
is not unique to chips designed using the SoC methodology, it 
lends itself to the use of an elegant solution to the problem: one 
or more programmable logic cores can be incorporated into the 
SoC. 

A programmable logic core (PLC) is a flexible logic fabric 
that can be customized to implement any digital circuit after 
fabrication. Before fabrication, the designer embeds a pro- 
grammable fabric, consisting of many uncommitted gates and 
programmable interconnects between the gates, onto the chip. 
After the fabrication, the designer can then program these gates 
and the connections between them to serve different applica- 
tions or implement design changes. These configurable logic 
blocks and connections have also been commonly referred 
to as embedded FPGAs (field programmable gate arrays), as 
opposed to stand-alone FPGAs that have been available for two 
decades. 

Several companies already provide programmable logic cores 
[1]-[4]. Yet, the use of these cores is still far from mainstream. 
There are a number of reasons for this: 


1) Tools for the design and integration of programmable fab- 
rics are not widely available as yet. This is somewhat of 
a chicken-and-egg problem: existing tools and flows will 
not be enhanced to support the seamless integration of 
programmable logic cores until this design technique be- 
comes mainstream, and the design technique will not be- 
come mainstream until the tools are enhanced to support 
programmable logic cores. However, as chip design costs 
escalate, the economics of chip design will be a strong 
driver for increased hardware programmability. 

Programmable logic cores come in relatively fixed for- 
mats. That is, the integrated circuit designer can not 
modify the overall size of the fabric or the internal 
structure of the programmable logic core. The integrated 
circuit designer must choose a programmable logic core 
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that is closest to the desired size; this could lead to 
wastage of chip area. This can be addressed by providing 
tiles of programmable logic that can be snapped together 
to form a design logic fabric of the desired size to mini- 
mize the area penalty. 

3) Embedded programmable logic is not as efficient as hard- 
wired logic in terms of area, power and speed. There are, 
however, special-purpose fabric generators emerging that 
can provide a better tradeoff between these specifications, 
depending on the target application. 

In spite of these barriers, we believe that the use of embedded 
programmable fabrics will continue to increase on both ASIC 
and SoC designs. There will be a need for large-grain, medium- 
grain and fine-grain fabrics to serve a variety of needs on the 
chip. Of particular interest in this paper is the use of fine-grain 
programmable fabrics. There are many cases where an inte- 
grated circuit designer would prefer to have many very small 
regions of programmable logic, rather than a single or handful 
of large programmable logic regions. As a simple example, con- 
sider a control logic block which coordinates the operation of 
the rest of the chip; it may be beneficial to map selected parts of 
this control logic to programmable logic, rather than the entire 
control logic block. 

In this paper, we describe a novel method for incorporating 
fine-grain programmable logic cores into an SoC. Rather than 
providing “hard” rectangular layouts, core vendors would 
provide “soft” descriptions of their programmable logic cores 
(PLC). Alternatively, the user could develop these cores 
themselves without much difficulty. These descriptions would 
typically be written at the register transfer level (RTL) in a hard- 
ware description language (HDL), such as VHDL or Verilog. 
We refer to this as a soft PLC. The integrated circuit designer 
could then incorporate the soft PLC description into the RTL 
description for the rest of the (nonprogrammable) chip, and 
then synthesize the entire chip using existing synthesis tools. 
The advantages and certain limitations of this approach are the 
subject of this paper. 

In [5], Phillips and Hauck describe the Totem architecture, 
which is a coarse-grained programmable logic fabric. Phillips 
and Hauck describe several ways of implementing their fabric, 
one of which is to use a soft description mapped to standard 
cells. Unlike our approach, however, they focus on large coarse- 
grained fabrics rather than the small fabrics that might be incor- 
porated into an SoC. Reference [6] also describes a standard-cell 
implementation of a programmable logic fabric, but again, it 
does not specifically target the SoC domain. 

This paper is organized as follows. First, the soft PLC tech- 
nique is described in more detail in Section II. Sections II and 
IV describes new architectures and place-and-route algorithms 
for these cores. Since the soft cores are intended to be synthe- 
sized using standard synthesis tools, it is unlikely that traditional 
FPGA architectures, optimized for full-custom layout, will be 
appropriate. We provide two novel architectures [7], [8] that are 
designed specifically for these soft cores. Section V identifies 
key parameters for our architectures, and seeks optimum values 
for these parameters. Finally, Section VI describes our experi- 
ences with a test chip that was fabricated using one of our syn- 
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thesizable programmable logic cores. Conclusions are provided 
in Section VII. 


Il. Sorr PLC DESIGN FLOW 


As described in the introduction, integrated circuit designers 
who wish to use a programmable logic core typically receive a 
“hard core” which contains the actual physical transistor layout 
information. The size and shape of the core is fixed; the only 
freedom the designer has is where to position the core on the 
chip and how to connect the I/O to the block. However, using 
our scheme, the designer receives the core in the form of a “soft 
core’. A “soft core” is one in which the designer obtains an 
RTL description of the behavior of the core, written in Verilog 
or VHDL. In this sense, it is similar to the definition of a soft 
IP core used in SoC designs [15]. The distinction is that, in a 
soft PLC, the user circuit to be implemented in the core is pro- 
grammed after fabrication. 

The value of this approach is derived from the tools needed to 
implement the fabric. Since the designer receives only an RTL 
description of the behavior of the core, synthesis tools must be 
used to map the behavior to gates and eventually to layout. These 
tools can be the same ones that are used in the standard ASIC 
flow. In fact, the primary advantage of the new method is that ex- 
isting ASIC tools can be used to implement the chip. No modifi- 
cations to the tools are required, and the flow follows a standard 
integrated circuit design flow. This will significantly reduce the 
design time of chips containing these cores. 

A second advantage is that this technique allows small 
blocks of programmable logic to be positioned very close to the 
fixed logic that connects to the programmable logic to improve 
routability and shorten wire lengths. The use of a “hard core” 
requires that all the programmable logic be grouped into a small 
number of relatively large blocks. A third advantage is that the 
new technique allows users to customize the programmable 
logic core to better support the target application. This is be- 
cause the description of the behavior of the programmable logic 
core is an RTL description that can be understood and edited by 
the user. Finally, it is easy to migrate the programmable block 
to new technologies; new programmable logic cores from the 
core vendors are not required for each technology node [15]. 

Of course, the main disadvantage of the proposed technique is 
that the area, power, and speed overhead will be significantly in- 
creased, compared to implementing programmable logic using 
a hard core. Thus, for large amounts of circuitry, this technique 
would not be suitable. It only makes sense if the amount of pro- 
grammable logic required is small. In Section V, we will quan- 
tify this tradeoff, but first we explore the issues of design flow 
and architecture suitable for'such an approach. 

The basic design flow employing soft PLCs is as follows: 

1) The integrated circuit designer partitions the design into 
functions that will be implemented using fixed logic and 
programmable logic, and describes the fixed functions 
using a hardware description language. At this stage, the 
designer must determine the size of the largest function 
that will be supported by the core; this can be done either 
by considering example configurations, or based on the 
experience of the designer. 


WILTON et al.: DESIGN CONSIDERATIONS FOR SOFT EMBEDDED PROGRAMMABLE LOGIC CORES 487 





Pass 


transistors 

















Fig. 1. Comparison of standard FPGA and soft PLC blocks. (a) standard FPGA logic block. (b) Soft PLC logic block. 


2) The designer obtains an RTL description of the behavior 
of a programmable logic core. This behavior is also spec- 
ified in the same hardware description language. 

3) The designer merges the behavioral description of the 
fixed part of the integrated circuit (from step 1) and the be- 
havioral description of the programmable logic core (from 
step 2), creating a behavioral description of the block. 

4) Standard ASIC synthesis, place, and route tools are then 
used to implement the soft PLC behavioral description 
from step 3. In this way, both the programmable logic core 
and fixed logic are implemented simultaneously. 

5) The integrated circuit is fabricated. 

6) The user configures the programmable logic core for the 
target application. 

Note that in Step 4 of the design flow, there is an important dif- 
ference in the implementation of the programmable logic for a 
standard FPGA fabric and a soft PLC fabric, as illustrated in 
Fig. 1. Consider the simplified view of a 3-input lookup table 
(3-LUT) used in an FPGA. The standard fabric uses SRAM 
cells to store configuration bits and pass transistors to implement 
the 3-LUT shown in Fig. (a). In the soft PLC case shown in 
Fig. 1(b), a standard-cell library is used to implement the same 
3-LUT. In fact, all desired functions of the soft PLC are con- 
structed from NANDs, NORs, inverters, flip-flops (FF) and multi- 
plexers from the standard cell library. The same holds true for 
the programmable interconnect in the FPGA and soft PLC. 

To emphasize this point further, consider how the complete 
fabric would be constructed in the two cases. For the soft PLC, 
the final logic schematic and layout is determined by the logic 
synthesis tool, technology mapping algorithms, and the place- 
and-route tool. In the case of a hard fabric, a custom layout ap- 
proach is used to create a “tile” for the FPGA. Then the FPGA 
fabric is assembled by replicating the tiles horizontally and ver- 
tically. Clearly, the standard FPGA approach is more area effi- 
cient but the soft PLC has the advantage of ease of use. 


Ill. PROPOSED ARCHITECTURES FOR SOFT PLC 


Now that the main features of the approach have been out- 
lined, we describe two alternative architectures for a soft pro- 
grammable logic core. The first proposed architecture is very 
similar to a standard FPGA architecture with some adjustments. 


However, this approach still has a significant area penalty. Since 
the desired fabric is intended for fine-grain programmability, 
one would expect the architecture to be different from standard 
FPGAs. As will be shown in Section V, we can reduce the area 
of our core by removing some degree of flexibility; the second 
architecture contains fewer programmable switches and hence 
is more area-efficient, yet contains enough flexibility to imple- 
ment small circuits. 


A. Architecture 1: Directional Architecture 


The most straightforward way to implement a synthesizable 
programmable logic core is to describe the behavior of a stan- 
dard FPGA at the RTL level using a hardware description lan- 
guage. The standard FPGA blocks are fairly complex and allow 
for both combinational and sequential elements. It is important 
to carefully consider the target applications and the required 
complexity of the programmable blocks. In doing so, we can 
make the following observations. 

Observation 1: Synthesizable programmable logic cores only 
make sense for very small amounts of programmable logic. An 
envisaged application would be the next state logic in a state 
machine. In that case, only combinational functions are needed. 

Observation 2: Many CAD tools (the tools that will be used 
to synthesize the programmable logic core, perform timing ver- 
ification, etc.) have problems with combinational loops. 

These observations motivate us to modify a standard FPGA 
architecture. First consider Observation 1. Since we are tar- 
geting small amounts of logic, we began with an architecture 
that will only implement combinational logic, allowing us to re- 
move all flip-flops needed for sequential logic functions. Flip- 
flops can be added at the inputs and outputs of the programmable 
logic core by the IC designer if desired. Removing flip-flops re- 
duces area and simplifies timing analysis. Of course, the flip- 
flops associated with the programming cells are still required 
for both logic and interconnect blocks. 

Observation 2 leads to a more interesting problem since an 
un-programmed PLC contains many combinational loops. Al- 
though these loops are ultimately false paths, they can still pose 
problems for CAD tools and during the actual configuration bit 
programming process. Thus, we have created a “directional” ar- 
chitecture in which the flow between logic blocks can only occur 
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from left to right. Since our architecture only implements com- 
binational circuits, this will not allow any loops in the logic; any 
feedback loops that are required would be implemented outside 
of the core. 

Based on these observations, we have created the architecture 
shown in Fig. 2(a). Each switch block is a standard switch block, 
with all right-to-left connections removed, as shown in Fig. 2(b). 
A simplified view of the 3-LUT is shown again in Fig. 2(c). The 
choice of a 3-LUT (as opposed to a 4-LUT or 5-LUT) was based 
on the observation that the ratio of logic area divided by routing 
area is larger in a synthesized core than a hand-optimized core; 
thus, we found that a smaller LUT is more efficient. 
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B. Architecture 2: Gradual Architecture 


We can consider more efficient architectures by making the 
following additional observations. 

Observation 3: Since we are implementing such small cir- 
cuits, we should consider removing some flexibility to improve 
area efficiency. 

Observation 4: Since the core will be hardwired into a fixed- 
function chip, we will require additional flexibility on the inputs 
and outputs. 

Observation 5: Unlike a hard FPGA layout, it is not critical 
that each tile be identical. In a hard layout, FPGA vendors do 
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not wish to layout multiple tiles; in our case, the fabric is syn- 
thesized and laid out automatically by CAD tools. Therefore, we 
have some freedom in defining the structure of the underlying 
fabric. 

These observations lead to the architecture in Fig. 3, which we 
call the “Gradual Architecture.” Like the Directional Architec- 
ture, signals in the Gradual Architecture flow from left to right, 
and the logic resources consist only of 3-LUTs, However, in this 
architecture, the number of horizontal routing channels gradu- 
ally increases from left to right, since more outputs are gener- 
ated in each level that can be used as inputs by the downstream 
LUTs. The vertical tracks are only accessible through LUT out- 
puts (each vertical track can be driven by one LUT), and can be 
connected to horizontal tracks using a dedicated multiplexer at 
each grid point. Note that, except for this multiplexer, no switch 
block is required in this architecture. The extension of this archi- 
tecture to any number of rows and columns is straightforward. 

The routing multiplexers in the first column are different from 
the others. We have performed experiments showing that pri- 
mary inputs are frequently required in many different columns. 
Thus, we have included several routing multiplexers in each row 
(we will vary the number of these multiplexers in Section V). 
For each row there are one or more output select multiplexers 
to choose a primary output of the\circuit. The output multi- 
plexers choose between the outputs of all LUTs located in the 
last column and any horizontal line located above or below that 
specific row. The exception to this is that only one routing multi- 
plexer per row from the first column passes a signal to the output 
select multiplexers. 


IV. PLACEMENT AND ROUTING ISSUES 


Once a programmable logic core has been embedded into a 
chip design, and the chip has been manufactured, the user-de- 
fined circuit can be implemented on the core. A CAD tool is 
usually employed to determine the programming bits needed to 
implement the user-defined circuit. Since our architectures con- 
tain novel routing structures, some modifications must be made 
to standard FPGA placement and routing algorithms. In this sec- 
tion, we describe these modifications for the two architectures 
described in Section HI. 

It is important to note that we are not referring to the stan- 
dard cell placement and routing tools needed to implement the 
programmable fabric itself onto the chip. Rather, the algorithms 
in this section are used to implement a user circuit on the pro- 
grammable fabric after the chip has been fabricated. For ex- 
ample, the VPR tool [9] determines where to place the logic 
functions and how to form the connections between the logic 
functions on a given FPGA fabric. At the end of the process, the 
programming bits are generated for the fabric. These bits must 
be shifted into the fabricated chip to implement a user-defined 
circuit. The process is repeated if a different user circuit is to be 
implemented. 


A. Placement Algorithms 


1) Directional Architecture: The placement algorithm for 
the Directional Architecture described in Section II is based on 
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Fig. 4. Good placements on the Gradual Architecture. 


the original simulated annealing placement algorithm of VPR 
[9]. The only change was to impose a restriction on the placer 
which stipulates that input sources for all blocks must originate 
from the left of that block, Otherwise, it is viewed as an illegal 
placement. During the annealing, we never allow a move that 
would result in an illegal placement. 

The cost function used in the VPR placement algorithm de- 
pends on the delay of potential connections as well as on the 
Manhattan distance between pins. In a synthesized core, the 
delay between pins depends on where the individual cells that 
make up the core are positioned; it may be that adjacent blocks 
in the conceptual representation of Fig. 2(a) may be positioned 
far apart in the actual layout. However, for convenience, we base 
our placement cost function on the distances and delays in the 
conceptual representation. Improvements can be made by sup- 
plying the VPR tool with the extracted delay and distance infor- 
mation from the actual layout of the synthesized core. Instead 
of relying on the conceptual representation, we can then use the 
“physical” representation to obtain better delay estimates during 
placement and routing. 

2) Gradual Architecture: In the Gradual Architecture, the 
routing fabric is less flexible than a standard FPGA. Poor place- 
ments can easily lead to un-routable implementations. We use 
a simulated annealing based algorithm with a unique cost func- 
tion for this architecture, as described below. 

Fig. 4 shows two examples of “good” placements on a sim- 
plified view of the Gradual architecture. In Fig. 4(a), a source 
logic block drives two sink logic blocks in the adjacent column. 
The corresponding net can be routed without any conflicts since 
no shared resources are required. Note that the input multiplexer 
used to feed each input pin of a logic block is not a shared re- 
source; there is one such multiplexer per input pin. Any number 
of sinks in the column immediately adjacent to the source can 
be connected in this way as shown in Fig. 4(a) for the case of 
two sinks. 

On the other hand, nets that drive logic blocks that are not in 
the immediately adjacent column must make use of routing mul- 
tiplexers; these are shared resources. In the example of Fig. 4(b), 
a net drives four sinks but only needs one routing multiplexer, 
since the sinks are all in two vertically adjacent rows (meaning 
that the track between the two rows can be used to drive all 
sinks). If another net also required the shaded routing multi- 
plexer, a conflict would arise when we tried to route the two 
nets. Since these routing multiplexers are shared resources, we 
wish to minimize the number of routing multiplexers used by 
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Fig. 5. Example placements. on the Gradual Architecture. 

each net. Therefore, we should penalize placements that gen- 
erate many such potential conflicts for the router. Again note 
that the input multiplexers used to feed the input pins of each 
logic block are not shared resources, and thus should not play a 
role in the cost of a given placement. 

Based on these considerations, a new cost function was devel- 
oped for the placement algorithm that directly relates to overuse 
of routing multiplexers. Before presenting the cost function it- 
self, we first describe certain factors that will be used in the func- 
tion. Consider the nets in Fig. 5(a) that would connect the indi- 
cated source and sink. In this case, we consider it equally likely 
that the final routed net will use one of the two indicated routing 
multiplexers; therefore, we define the demand for each of the 
two multiplexers as 0.5 relative to the indicated source.and sink. 
In Fig. 5(b), it is almost certain that the routed net will use the 
indicated routing multiplexer, since that single multiplexer can 
be used to feed both sinks, so the demand for that net is close to 
1. Note that a valid route could be found that does not use this 
multiplexer; however, such a route would require two routing 
multiplexers. During placement, we assume that this will not 
happen, and thus, set the demand term for all other routing mul- 
tiplexers for this net to 0. Of course, this does not mean the router 
is constrained to use this routing multiplexer. It is simply an as- 
sumption made to compute the cost function during placement. 

Fig. 6 shows a net that drives four vertically adjacent rows. In 
this case, we assume that the two indicated routing multiplexers 
are used with probability | during placement. Experimentally, 
we have determined that this leads to better results than if we as- 
sign all five routing multiplexers in that column the same value 
(which would be about 1/2). Again, note that the router is not 
constrained to actually use the indicated multiplexers. 

To derive the cost function, we start by defining an occupancy 
function, Occ(), of a routing multiplexer as an estimate of how 
many nets would like to use that routing multiplexer. We can 
write this as the sum of the estimated demand for a given mul- 
tiplexer by each net: 


Oceean)is yey demand (c, r, 2) 

n€Nets 

where demand(c, 7, 2) is the estimated demand for the routing 

multiplexer at column and row (c, r) by net n. As already de- 

scribed, the demand is a number lies in the range between 0 and 

1; 0 implies that there is little chance that the router will use this 

multiplexer to route net n, while 1 means that the router will, 

with high probability, use this multiplexer when routing net n. 
Next we define the capacity function, Cap(), of a routing 

multiplexer as the number of output lines available from a given 
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Fig. 6. Example placements on the. Gradual Architecture: Sinks in many 
adjacent rows. 


set of input lines. It is an estimate of the ability to satisfy the 
routing demand at a given location. Typically, the capacity of 
all routing multiplexers is set to 1 since each one has a single 
output. However, for those muxes in the first column, the ca- 
pacity is equal to the number of horizontal lines that can be 
driven from primary inputs. Referring back to Fig. 3, the ca- 
pacity function would be 3 since three muxes drive 3 adjacent 
horizontal lines from the same set of primary inputs at each 
location. 

With these definitions in place, the cost of a given placement 
on a C-column, R-row core is given by 


BG 
Cost = _ ss max|0, (Occ(c, rr) — Cap(c,r) + 7)] 


T=0 c= 


where Occ(c,7) is the occupancy demand of a routing multi- 
plexer at location (c,r), and Cap(c, 1) is the output capacity of 
multiplexers at location (c,r). We take the difference between 
Occ() and Cap() to incorporate the fact that one or more out- 
puts are available at each location. If the difference is negative, 
we set the cost of that routing mux to 0 using the max function. 
The ¥y term is a small bias value (set to 0.2 for our experiments). 


B. Routing Algorithms 


The negotiated-congestion based routing algorithm from 
VPR [9] was used without modification for both architectures. 
For the Gradual Architecture, the routing task is very easy 
since there are only a few potential routes for each net. For the 
Directional Architecture, there are many potential routes so 
the routing is more complex. The use of the advanced router 
within VPR gave us ability to evaluate different architectures 
and placement schemes during our architectural investigation. 
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TABLE I 
DIRECTIONAL AND GRADUAL ARCHITECTURE RESULTS 
Directional Architecture el. Gradual Architecture 
Benchmark FPGA Core | Tracks per | Cell Area FPGA Core | Input Muxes | Cell Area 
Circuit Size Channel (um?) Size | per row (um?) 
ce 9x9 4 300 460 8x8 | 3 263 101 
em138a 5x5 3 4 80 868 5x5 Eee OTS 
; jcem150a_ 9x9 _4 | 300460 | ae 32) 263 101, | 
| eml5la kos 3 80868 | 4x4 IEE» 43.932 | 
em152a 4x4 3 53 004 4x4 | 43 932 
cml62a 5x5 4 96 854 5x5 2 89 614 
cml63a [ 6x6 5 174 589 5x5 2 89 614 
jom42a_ | SxS : 4 96 854 5x5 eee 89614 
cm82a_ 4x4 3 53.004 | 2x2 Hes ate a we OO Lat 
OMISIAL pes) h OKO EE HE 4 137 518 6x6 2 Slee 2 SEB 22, 
emb 7x7 3 154 407 7x7 2 184 590 
heompe jaf LAxI2. aap 528 332 lIxl1 i a2Ot. 542 489 | 
cond: 5° 7 4x4 3 ae 53 004 4x4 Loys4 43 932 
weount = Key eee 5 667 344 10x10 4 ___|_—_—-487 588 
cu 8x8 3 199 702 8x8 2 244 676 
_5xpl SE coo x les ly 5 562 305 1x1] 2 542 489 
| il i eae) $e al 3 199 702 7x7 2 | 184590 | 
HMC STs Ae | _ 10x10 5 466 121 10x10 2 424445 | 
unreg 10x10 | 4 368 620 9x9 4 388 074 
| Average _ f 240 737 e 218009 | 
| Geo. Avg. ei eens 1gs0lee ee. eee 141954 | 





V. EXPERIMENTAL RESULTS 


In this section, we experimentally compare the two architec- 
tures described in Section III. We used 19 small combinational 
MCNC benchmark circuits [14]. We selected small circuits 
since these are the type of circuits we expect to be used with 
our architecture; large circuits would likely be implemented 
using hard programmable logic cores. For each circuit, we 
initially found the minimum-size square core on which the 
circuit can be placed and routed. We then created a VHDL 
description of each core, and synthesized it using Synopsys 
Design Compiler™ and a standard 0.18-j2m CMOS library. 
The cell area reported by the Synopsys tool was used for a basis 
for comparison in Table I. 


A. Directional Architecture Versus Gradual Architecture 


The first four columns of Table I show the results for the Di- 
rectional Architecture. For each benchmark circuit, we varied 
both the core size and the number of tracks in each channel, and 
chose the configuration which resulted in the minimum area; the 
chosen size and channel width are shown in columns two and 
three of the table. For each configuration, we then synthesized 
the architecture using Synopsys; the fourth column in the table 
shows the cell area required to implement the core. 

The final three columns show the results for the Gradual Ar- 
chitecture. In this case, we varied both the core size and the 
number of input multiplexers per row, and chose the configu- 
ration which resulted in the lowest area. These numbers are re- 
ported in columns five and six of the table, and the synthesized 
cell area from Synopsys is shown in the final column. From the 
last row of the table, the geometric average of the area required 
to implement the circuits on the Gradual Architecture is 18.9% 
less than that required to implement the same circuits using the 
Directional Architecture. 


B. Soft Versus Hard Programmable Logic Cores 


As mentioned in Section II, the primary disadvantage of using 
a “soft” programmable logic core is the reduced density, speed, 
and increased power consumption. In this subsection, we esti- 
mate the area penalty of a soft core compared to a hard core. 

The most accurate way to compare the area required by soft 
and hard programmable logic cores would be to lay out (by 
hand) a hard core, and compare its area with the numbers in 
Table I. This is a time-consuming task. Instead, we estimated 
the size of a hard core using a detailed transistor-count model, 
following the methodology described in [9]. We focus on a 
4x 4 Gradual Architecture with three input multiplexers per 
row. By estimating the number of minimum transistor equiva- 
lents (MTEs) required to implement the circuit, and converting 
this to area in our 0.18-j/m technology, we estimate the layout 
area. of such a core to be 12868 jum”. A soft core was generated 
using these same parameters, and the size (after synthesis using 
Synopsis and physical design using Cadence) was 81092 jum?. 
Thus, the synthesized core requires approximately 6.4 more 
area than the hard core. 

This number is significant. Clearly, for large programmable 
logic cores, our approach would not be suitable. However, if 
only small amounts of programmable logic are required, this 
density penalty may be acceptable. In addition, the use of a hard 
core will usually require the selection of a core from a library. 
Since it is unlikely that a library would contain all sizes and 
shapes of cores, in most cases, a designer would end up choosing 
a larger core than is required. Using a soft core, the designer can 
create a core of any size. Even if a core of the appropriate size 
was created, the difficulty inherent in embedding hard cores may 
make the use of hard cores less attractive than our soft approach. 

We have also compared our sizes to commercial FPGA lay- 
outs using publicly available information. These comparisons 
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TABLE II 
SENSITIVITY OF RESULTS 
1/O Connections Grad vs. Dir Simulated Annealing Algorithm Grad vs. Dir 

Default I/O connections 18.9% Percent Difference, baseline algorithm 18.9% 

Half as many I/O connections 9.67 % Percent Difference, fast algorithm 15.5% 
_Twice as many I/O connections 2.33 % Margin 3.4% 

Margin 9.23 % Conclusion Slightly 

Conclusion Sensitive Sensitive 























yield little insight, however, since the commercial devices con- 
tain far more tracks per channel, and contain additional elements 
such as flip-flops in the logic blocks. 


C. Sensitivity of Results 


As described in [11], it is critical to analyze results for their 
sensitivity to experimental assumptions. Table II shows two 
of our sensitivity results for the data in Table I. The first part 
of the table shows how the conclusions change if we alter the 
number of input/output connections per grid. In the experi- 
ments in Section V-A, it was assumed that an n x n Directional 
Architecture has 2n input/output connections along each of the 
four edges of the core, and that an n x n Gradual Architecture 
has 4n input/output connections along the left and right edges 
of the core. We attempted to use two other input/output ratios, 
and gathered the results in Table II. Although the Gradual Ar- 
chitecture always produced higher density than the Directional 
architecture, the margin by which the Gradual was better varied 
(we do not have enough data to conclude that this is a result 
of anything other than experimental “noise”’). According to the 
methodology in [11], we classify this experiment as sensitive to 
the input/output ratio, even though the conclusion that Gradual 
is better than Directional was the same in all cases. 

The second part of the table shows how a less aggressive 
placement schedule (fewer moves per temperature and larger 
temperature drops during the annealing) and routing schedule 
(fewer routing attempts) affects the conclusions. In this case, the 
margin was smaller, meaning the experiment was only slightly 
sensitive to the choice of algorithm. 


D. Nonrectangular Fabric 


The grid of logic blocks in standard FPGAs is usually square 
or rectangular. From [12], however, logic circuits often have a 
“triangular” shape as shown in Fig. 7(a). In standard FPGAs, 
this does not present a problem, since the routing resources are 
flexible enough that signals can be routed left, right, up, or down, 
as shown in Fig. 7(b). This means that in a standard FPGA, the 
physical implementation of a circuit need not match the fanout 
shape of the circuit. In the architectures described in this paper, 
however, the signal flow is restricted from left to right. As shown 
in Fig. 7(c), this can lead to unused logic blocks if the circuit 
does not have a naturally square shape. 

We can alleviate this problem somewhat by creating a pro- 
grammable logic core that is not square. We have observed that 
in many implementations, several logic blocks in the rightmost 
columns remain unused. We can take advantage of this by 
removing logic blocks from the last few columns, as indicated 
with shading in Fig. 7(c). We quantify the number of logic 
blocks removed using the parameter c, where c is defined as 








Fig. 7. Implementing a circuit on a triangular core. 
the proportion of the logic blocks in the top row that have been 
removed. In Fig. 7(c), c is 2/3. In all cases, we remove blocks 
in a “triangular” fashion; if we remove m blocks from column 
z, we remove m — 1 blocks from column 7 — 1. A value of 0 for 
c indicates a rectangular core; a value of | indicates a triangular 
core. Note that a nonzero value of c does not imply a nonrect- 
angular final layout. The diagram in Fig. 7(c) is a conceptual 
representation; the core will be synthesized into gates, and the 
gates will be placed into rows of standard cells regardless of 
the shape of the conceptual representation. Intuitively, as c is 
increased, the area of the implementation will go down. If ¢ is 
decreased too much, however, the area will rise, since a larger 
virtual grid will be needed. This effect can be seen in Fig. 8. 
Fig. 8(a) shows how the implementation area depends on c for 
each circuit implemented on the Gradual Architecture (each 
line represents a different circuit). Because we were unable 
to synthesize large triangular cores using our synthesis tools, 
results are only shown for 11 of the 19 benchmark circuits. The 
geometric average over these 11 circuits is shown in Fig. 8(b). 
Although each individual circuit in Fig. 8(a) exhibits its own 
characteristics, the results in Fig. 8(b) indicate that the overall 
gain obtained using a nonzero value of c is relatively small. 
From Fig. 8(a), the “breakpoint” (the point at which a larger grid 
is needed) is not the same for each circuit. Thus, the average re- 
sults show that only a modest improvement can be achieved. 
Overall, the value of c that gave the lowest area was 0.6, which 
resulted in an 11.1% lower area than a square core, averaged 
over all circuits. 


VI. PROOF-OF-CONCEPT IMPLEMENTATION 


To investigate the implementation issues of our synthesiz- 
able embedded core approach, we have chosen a module derived 
from a chip testing application. This module acts as a bridge be- 
tween a test access mechanism (TAM) circuit [13] and an IP core 
under test. In the research work described in [13], the TAM is ac- 
tually a communication network that transfers test data to/from 
internal IP blocks on the chip in the form of packets. The module 
we selected allows the TAM and the IP core to run at different 
frequencies, resulting in higher overall TAM throughput. A chip 
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Schematic of proof-of-concept module. (a) TAM-IP interface module (nonprogrammable). (b) TAM-IP interface module (programmable). 
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designed with this type of network TAM would contain one of 
these selected modules for each IP core on the chip. 


A. Reference Version 

Fig. 9(a) shows a block diagram of the module. The module 
consists of a buffer memory, a packet assembly/disassembly 
block, and two state machines. Packets received from the TAM 
circuit are optionally buffered before being converted to a form 
usable by an IP core under test. A key component in the module 
is the Packet Assembly/Disassembly block which controls the 
assembly and disassembly of test packets based on a given 
packet format. The packet format was subject to change from 
time to time during the course of the research described in [13] 
which required a re-design of this block. 


B. Programmable Version 


When packet formats are modified to adjust header, data and 
address information, the control circuitry must also be modi- 
fied. Noting this fact, we decided that the next-state logic would 
benefit from programmability. This would allow the user to 
modify some packet processing and control operations simply 
by re-programming the block. If the next state’ logic of the 
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Area as a function of ¢ for Gradual Architecture. (a) One trace per benchmark circuit. (b) Geometric average over benchmark circuits. 
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state machine is made programmable, as shown Fig. 9(b), new 
schemes can be implemented after fabrication of the integrated 
circuit. Although a hard programmable logic core could also 
be used here, it is better suited to the soft PLC approach due to 
its fine-grain nature. 


C. Implementation Issues 


We designed two versions of this module: 1) the reference 
version with no configurability, and 2) the programmable ver- 
sion, in which the assembly/disassembly control is removed 
and replaced with a soft programmable logic fabric. The fabric 
uses the Gradual Architecture as it was found to be more effi- 
cient than the Directional Architecture. When adding the pro- 
grammable component to our module, a number of other inter- 
esting issues arose. This section summarizes these issues. 

1) Programmable Logic Core Size: The first issue was how 
much programmable logic is needed to replace the fixed next 
state logic. Without knowing the actual logic function that will 
eventually be implemented in the core, it is difficult to estimate 
the amount of programmable logic required. However, in this 
case, we have domain knowledge regarding the types of func- 
tions that will be implemented, and we can use this knowledge 
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Fig. 10. Programming clock tree routing complexity. (a) Portion of Gradual Architecture. (b) Physical design of programmable module. 


to make reasonable decisions. We designed two user logic func- 
tions that would be implemented in the core, and determined 
the size of the core that would be required to implement each 
function using VPR [9]. For our circuit, we found that a core 
consisting of 49 LUTs (i.e., a 7x 7 array of 3-LUTs) would be 
sufficient for both potential logic functions; however, to allow 
some safety margin and anticipation of larger functions, a core 
of 64 LUTs (8x 8 array) was used. 

2) Connections Between the Core and the Fixed Logic: A 
second issue is how the programmable logic core is connected to 
the rest of the module. Although the core itself is programmable, 
specific inputs and outputs must be connected to the core in ad- 
vance. This will dictate which functions are possible to imple- 
ment in the core. Again, we have domain knowledge to assist 
us with this decision. We can select which inputs are connected 
to the core and which outputs will be made available from the 
core. In our design, the two user logic functions required 9 in- 
puts and 10 inputs, respectively, and required 11 outputs and 12 
outputs, respectively. We afforded ourselves some flexibility by 
hardwiring a selected set of 10 inputs and 13 outputs to our core. 

3) Routing the Programming Clock Signal: During physical 
design process, it was apparent that our synthesizable core was 
placing an extra burden on the router due to the large number 
of flip-flops in the design. A programmable logic core contains 
many configuration bits to store the state of individual routing 
switches and the contents of lookup tables; in a synthesizable 
core, these configuration bits are built using flip-flops that have 
clock inputs to enable programming. As shown in Fig. 10(a), 
there are configuration bits for input muxes and output muxes, 
as well as the LUTs themselves. Each of these FFs must be 
connected to acommon clock signal for programming purposes 
as indicated by the bold line. 

To determine how flip-flop-intensive our core is, we com- 
pared its flip-flop density to that of a nonprogrammable design. 
We analyzed an ASIC implementation of a 68HC11 core, and 


found that the flip-flop density (number of flip-flops per unit 
area) was 1/3 of the flip-flop density in our programmable logic 
core. Thus, we realized that the clock tree in our core will be 
more complex and consume more chip area than a typical ASIC. 
This was confirmed; in our implementation, 45% of the layout 
area was consumed by the clock tree, power striping, and signal 
routing (experience with other ASICs of this size has shown that 
25% is usually enough). Furthermore, FFs must be connected as 
one long shift register for programming purposes, and this also 
added to the routing complexity. 

The results of the physical layout of the bit configuration 
clock routing are shown in Fig. 10(b). Our core contains 1803 
such flip-flops, each connected to the bit configuration clock 
signal. The clock net highlighted in white is the configuration 
clock;.this routing is clearly more complex than the other nets 
(shown in grey). This extra clock complexity increases the area 
overhead of the design, beyond what would be estimated by just 
considering only the standard cell area. In our case, this is a no- 
table source of area overhead, since the original next state logic 
was purely combinational logic with no FFs or clocks. Note that 
this clock tree overhead would occur in both a soft and hard pro- 
grammable logic core. 


D. Implementation Results 


1) Area Overhead: We implemented both the pro- 
grammable and nonprogrammable versions of the module 
using the same tool flow to further quantify the area overhead. 
The reference module (without the programmable logic core) 
required 369 700 jum? in a 0.18-jum TSMC process, of which 
1217 jum? is the area due to the assembly/disassembly con- 
troller next state logic. The programmable module (containing 
64 LUTs as described above) required 1 025 000 pm?, of which 
684 600 jum? was due to the programmable next state logic. 

The layout areas are summarized in Table III. Clearly, the 
differences in these numbers are significant. Our synthesizable 
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TABLE Ii 
AREA RESULTS SUMMARY 


oy | Area of Next Area of Entire 
| Implementation State Logic Chip 


Method 


| Non-Programmable i i hiea? ae nes 
L (measured) <1/7 Um 36 uum 








| Hard Prog. Core. 
(estimated using 


results from [9]) 


| 107 000 um? 
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TABLE IV 
SPEED RESULTS 
Critical Path of | Critical Path of 
Module 
(using first user- 
defined logic 
function) 


] ] 
3 Oh 25.40 ns | 

= -— 
apis | 


Integrated Circuit 
(using second 
user-defined logic 
function) 








| Reference Module 
| (no programmability) 


| Programmable Module 
| | (wi ith synthesizable core) _ 


25.40 ns 


51.08 ns 





programmable logic core required 560 more chip area than 
the fixed logic that it replaced. From the analysis in Section V, 
the synthesizable core requires 6.4 more area than a hard pro- 
grammable logic core. However, the use of a hard core may not 
be suitable for such fine-grain applications. It would require the 
same considerations as any other hard IP plus additional ones 
for programmability. For the size of fabric being used, the soft 
PLC would provide a more seamless approach. 

Further investigation into the area overhead showed that 53% 
of the area of our programmable logic core was due to routing 
multiplexers and the configuration bits that control these mul- 
tiplexers, as shown in Fig. 10(a). These multiplexers are large; 
the largest in our core has 26 inputs. Our standard cell library 
contains only two- and four-input multiplexer cells; larger 
multiplexers are built by cascading these smaller multiplexers. 
Clearly, the area overhead could be improved significantly by 
either supplementing our cell library with larger multiplexers, 
or modifying the architecture to employ smaller multiplexers. 

2) Delay Overhead: We measured the speed of our refer- 
ence and programmable modules before and after physical de- 
sign. Table IV shows the post-physical design results. In this 
case, we configured the core using the two user-defined logic 
functions mentioned above, and measured the length of the crit- 
ical path through the logic circuit in each case. As the table 
shows, the results indicate that the programmable core has ap- 
proximately twice the critical path delay as the reference design, 
for both user-defined functions. 

The module containing the programmable fabric was fabri- 
cated in 0.18-jzm TSMC CMOS and tested using the same two 
user-defined logic functions. The speed results correlated well 
with the results shown above. The chip design had a critical path 
of about 40 ns compared to the expected 50 ns, well within the 
error tolerances of the models used in the CAD tools and the 
statistical variations of the CMOS process. 


VII. CONCLUSION 


In this paper, we have presented two new architectures 
for synthesizable programmable logic cores. Synthesizable 
programmable logic cores are different than the programmable 
cores currently available from vendors in that they are obtained 
as a HDL description, and synthesized using standard synthesis 
tools. The use of these cores has significant area overhead; we 
have estimated an overhead of 6.4 compared to using “hard” 
programmable logic cores. Yet, for small logic circuits, these 
“soft” cores have a number of advantages: they are easy to 
integrate with fixed logic, we can create cores of any size and 
shape, and they are easy to migrate to a new technology. 

One of the primary applications we envisage for these cores 
is the implementation of small combinational logic blocks, such 
as the next-state logic or output-logic of state machines. As a 
result, our architectures are different than traditional FPGAs 
in that they only support combinational circuits, and are “di- 
rectional” in that signal only flow in one direction through the 
fabric. In addition, the interconnect pattern is less flexible and 
the routing resources less plentiful. We have performed exper- 
iments to show that small combinational circuits can be imple- 


_mented on these cores efficiently. 


This paper also has illustrated some the issues that arise when 
such a core is used, through the use of a proof-of-concept chip: 
the choice of the size of a core, the choice of inputs and outputs, 
and the difficulty in routing the flip-flops. 

Better synthesis results could be obtained by adding special- 
ized cells to the standard-cell library to. implement our pro- 
grammable logic fabric. We have not considered this in this 
paper, since our goal was to create architectures that can be 
implemented using the standard synthesis tools, cell libraries, 
and design flows that are already familiar to integrated circuit 
designers. However, initial experiments have shown that, by 
removing unnecessary features, we can create a replacement 
for our flip-flop standard cell that is 40% the size of the stan- 
dard cell version. Since, in the entire fabric, the flip-flops ac- 
count for 43% of the chip area, we would expect significant 
savings if this standard cell was used to construct our fabric. We 
also expect that significant improvements can be obtained using 
custom-designed multiplexer standard cells. Clearly, if this de- 
sign technique is to become mainstream, specialized standard 
cells should be created. 

Although these soft cores are less efficient than their fixed 
counterparts, the use of programmable logic cores, and espe- 
cially synthesizable programmable logic cores, is still impor- 
tant. The post-fabrication flexibility that these cores provide will 
be vital as integrated circuits get larger and as masks get more 
expensive. Synthesizable programmable logic cores are a sen- 
sible solution when only small amounts of programmable logic 
are required, since they can be treated much like regular logic 
during the design process. The results of this paper clearly show 
that there is still work to be done improving their area and speed, 
but as new architectures are uncovered, and new CAD tech- 
niques are developed, it is likely that both hard and soft cores 
will become an important part of future integrated circuits. 
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Low Standby Power State Storage 
for Sub-130-nm Technologies 


Lawrence T. Clark, Senior Member, IEEE, Franco Ricci, and Manish Biyani 


Abstract—Handheld and other battery-powered ICs require 
process scaling to increase functional integration and reduce 
active power consumption. Scaling also increases leakage current 
components to the point where standby power is frequently a 
limiting design factor. A scheme combining low-leakage thick-gate 
shadow latches and high-performance transistors is presented that 
decouples performance from standby power in sub-130-nm tech- 
nologies. Circuit design and operation, including pulse-clocked 
latches, use of dynamic circuits, and inclusion of scan is presented. 
The approach is validated by experimental results on a 90-nm 
process. 


Index Terms—Leakage currents, logic circuits, low power, se- 
quential logic circuits. 


I. INTRODUCTION 


NTEGRATED circuits designed for handheld and cell 

I phone applications must meet stringent energy require- 
ments due to limited battery capacity. Long device idle times 
make standby power a limiting factor in battery lifetime. Si- 
multaneously, lower operating voltages reduce active power by 
the well-known quadratic factor. Additionally, lower operating 
voltages are required by process scaling, which in turn, drives 
lower threshold voltage (V;) to maintain gate overdrive as the 
power supply, Voc is scaled. Unfortunately, this increases tran- 
sistor sub-threshold currents exponentially, leading to tradeoffs 
between active and standby power in process selection unless 
standby leakage is reduced by circuit design. Various schemes, 
primarily focusing on application of reverse-body bias (RBB) 
[1]-[4], or MTCMOS approaches [5]-[7], have been suggested 
and used in products to address the primary leakage compo- 
nents. These are the transistor off-state drain to source leakage 
(Jor), as well as drain to bulk components, due to gate induced 
drain leakage (GIDL) or direct tunneling from drain to bulk 
in transistors with steep doping profiles, especially those with 
pocket or halo implants [3], [8]. Since it is costly in terms of 
both time and power to save and then restore the state of an IC, 
it is imperative that any implementation be state retentive [19]. 
At the 130-nm technology node and beyond, oxide scaling 
produces significant gate oxide leakage (J,ate) contribution due 
to direct band-to-band tunneling [9] since it must keep pace with 
transistor channel length to maintain adequate control [9], [10]. 
Consequently, leakage reduction schemes for this and future 
technology nodes need to address this increasingly important 
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component. The alternative is to use a thicker oxide and sacri- 
fice performance by attempting to make up for loss of transistor 
gate control by very high doping. 

Low standby power is frequently achieved by limiting tran- 
sistor scaling to avoid leakage increases. However, the power 
supply voltage (Vcc) reduction that scaling allows is the best 
method to limit active power, as illustrated in Fig. 1 comprising 
0.18-jzm microprocessor performance on two otherwise iden- 
tical processes having V; differing by 110 mV, equivalent to 
about 25x Io leakage reduction. The 390 mV data is calibrated 
to an existing design that includes a low-standby-power mode 
combining RBB and power supply collapse [3], [11] while the 
500 mV V; data is simulated. For each data point on the curves, 
the processor is run at the maximum frequency allowed for the 
given voltage, while in the low V; combined with a low-standby- 
power mode case, excess cycles are spent in the low-standby- 
power state. Voltage was scaled upwards in 100 mV increments 
from 0.6 V as required by performance. 

The lower curve in the figure shows that introduction of a 
RBB low-power state, time multiplexed with active operation, 
can simulate a lower leakage process, while retaining the higher 
performance and lower power at high frequencies. The zero fre- 
quency points show that identical standby currents can be ob- 
tained, while at 400 MHz, with V; of 390 mV power is 42% 
lower than with V; of 500 mV. The potentially decreasing effi- 
cacy of RBB modes in future high-performance processes [12], 
experience in practical application, where maintaining state in 
domino circuits and imbalanced latches limits voltage collapse 
[3], and desire to make the low-power mode operable at sub-1 V 
Voc and hence more compatible with dynamic voltage scaling 
(DVS) led us to investigate alternative schemes. Regardless of 
the actual power savings approach employed, as long as such 
schemes are state retentive and invoke a small power penalty 
upon entrance and exit, the analysis embodied in Fig. 1 applies. 

In this paper, circuits to implement the low-power state, 
which addresses the increasing leakage components that face 
sub-130-nm technologies is presented. This is accomplished 
by placing the IC state in latches fabricated using thick-gate, 
high-V; transistors and cutting off the supply to the nonstate 
logic circuitry. This decouples the performance of the IC in 
active operation from the standby power, affording more ag- 
gressive scaling to even very power sensitive handheld devices 
such as cell phones and personal digital assistants. 

While the experimental circuits were fabricated in a 90-nm 
technology, the circuits and methods are applicable to future 
processes beyond the 65-nm technology node. Section II ad- 
dresses the basic circuit design and operation. Section III de- 
scribes the use in time borrowing latches and dynamic circuits, 
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Fig. 1. Power utilizing RBB power-down modes interspersed with active 
operation versus no power-down mode and higher V;. 


Section IV the addition of scan capability, and Section V com- 
prises the experimental results and discussion. We conclude in 
Section VI. This paper focuses entirely on logic rather than 
memory usage, i.e., register file, latch, and flip-flop applications, 
while neglecting SRAM, although the use of thicker gate for 
SRAM has been explored [13]. 


II. CIRCUIT CONFIGURATION AND PROCESS 


The basic latch element is shown in Fig. 2 and comprises 
a thin-gate transistor high-performance latch comprised of the 
CMOS pass gate, feedforward inverter IT and feedback tri-state 
inverter ITF. The shadow thick-gate latch is comprised of tran- 
sistors having both high V; affording low J,, and thicker oxide 
for low Ipate. The thick-gate region is outlined in the figure 
for clarity and the box gate symbols will be used to differen- 
tiate them from thin-gate transistors throughout this paper. This 
expands on the concept of high-V; balloons described in [6] 
and the idea of maintaining supply power only to the state el- 
ements in an IC, while cutting off leakage to the combinational 
logic via MTCMOS schemes [7], [14]. The thick-gate portion 
is powered by a separate supply Vccra. The thick-gate transis- 
tors have higher V; than the nominal thin-gate transistors, es- 
sentially severing the connection between low-voltage perfor- 
mance, maximum performance, and the standby power of the 
design. Early simulations showed that using the thick-gate tran- 
sistors for the storage elements limited the register file write 
speed to less than 300 MHz if written in the phase before a 
read, while target designs included performance up to 2 GHz. 
Similarly, late data input to a transparent latch could result in an 
unacceptable timing push-out. 

Our designs commonly use pulse-clocked latches to simulate 
master-slave flip-flops at lower power and size. Slower thick- 


Thick gate 
area 









ACT2LOW 


Fig. 2. Latch incorporating thick-gate state retention element. Thick-gate, 
high-V; transistors are evident by the box gate symbols. 


gate transistors would require wider clock pulses and increase 
effort aimed at meeting hold requirements in timing conver- 
gence. This is described in detail in Section I-A. Consequently, 
the redundant latch scheme as shown in the figure was chosen, 
whereby the thick-gate write time, invoked only during entrance 
into a low-power state, does not limit operational speed. It is ex- 
pected that the low-power state will be entered and exited at less 
than kHz rates, making the thick-gate write speed unimportant. 
For instance, in cell phone applications, the standby time can be 
on the order of seconds, between phone communications with 
the cell base stations. 

In the processes used to validate the circuits described here, 
the thick-gate transistors support IO and analog circuitry, which 
traditionally use a higher V; [15]. While a transistor optimized 
for this application would be preferable for electrical perfor- 
mance, it would increase process complexity and adversely 
affect die cost. Since storing the state in the separate latches 
requires no high voltages, the thick-gate transistors can be 
drawn at reduced channel length compared to their normal 
high-voltage design rules, to improve layout density. In prac- 
tice, layout density is limited by the thick to thin-gate-oxide 
spacing. 

During active operation Vectra is shorted to Voc on die to 
limit IR drop induced noise between the supplies. Upon power- 
down, the state is first written to the thick-gate domain, then 
the entire combinatorial logic portion has the power supply re- 
moved as in MTCMOS. Rather than gate the Vs supply node as 
done in our earlier RBB designs, the Vcc is removed externally 
at the regulator, mitigating the IR drop and die size associated 
with on-die power supply clamps. 
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A. Active Operation 


As mentioned, to limit active power and delay through se- 
quential elements, a pulsed-clock latch simulates a master-slave 
flip-flop (MSFF) as shown’in the waveforms in Fig. 3. This has 
been shown to afford greater than 40% clock and sequential el- 
ement energy savings as well as allowing some time borrowing 
to alleviate clock skew [11], [16]. The resulting sequential ele- 
ments are smaller than a MSFF, helping to limit the overall se- 
quential circuit size. These advantages are substantial enough to 
merit increased effort in designing to the greater hold times re- 
quired. The signal LOW2ACT is de-asserted low in active mode 
operation, decoupling the thick-gate portion from the thin-gate 
high-performance latch. The minimal added capacitance due to 
the drains of thick-gate transistors M2—M5, which can be min- 
imum sized, has a small effect on circuit speed and power in the 
active mode. 

Fig. 3(a) shows the write timing of the pulse-clocked latch. 
The signal LOW2ACT is asserted low and so is not shown. 
The storage nodes S1 and S1# are quickly written, allowing 
a short clock pulse on signal PCLK. The timing used is for 
a 1.5 GHz design with the clock period shortened to provide 
margin for worst-case clock skew. Timing analysis is performed 
across process corners and voltages to determine the appropriate 
clock pulse (PCLK) width for each target process. Fig. 3(b) il- 
lustrates the slower response of using the thick gate alone. For 
these purposes, transistors M1-M3 and the feedback thin-gate 
tri-state inverter ITF in Fig. 2 are removed. Otherwise the cir- 
cuit is unchanged. This makes the thick-gate latch, connecting 
nodes ST1 and ST1# the only state storage. Here, ACT2LOW 
is left enabled high so that nodes ST1 and ST1# can provide 
state storage in active operation. Note that thick-gate nodes ST1 
and ST1# respond much more slowly and with the same pulse 
width the storage nodes fail to write even at 1.2 V Voc. The 
design can only effectively pull down on the thick-gate storage 
node. This creates a slow transition, particularly when rising, 
since it is pulled up via the small thick-gate PMOS. The higher 
V, of the thick-gate transistors will cause even further degra- 
dation in write timing at lower voltages. This makes use of 
thick-gate-only latches with DVS problematic. 


B. Entering Standby Mode 


To enter the low-standby-power mode, ACT2LOW is as- 
serted high and the higher performance transistors in the 
thin-gate latch differentially write the thick-gate latch via the 
thick-gate pass transistors M4 and M5 as shown in Fig. 4. This 
operation relies upon the thin-gate devices having larger drive 
than the thick-gate devices. This is guaranteed by the lesser 
current drive of the thick-gate transistors due to their higher V;, 
as well as by sizing. Of course, this must be simulated across 
process corners and the required voltage range at worst-case 
opposing data conditions, where both charge sharing and op- 
posing currents may cause back writing. The thick-gate latches 
are all a single small size limited by the thick-gate design 
rules. Only the high-performance thin-gate transistors drive 
subsequent circuit stages. 

Entrance into the low-standby-power state is completed by 
subsequent de-assertion of ACT2LOW to isolate the thick-gate 
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Fig. 3. Pulse-clocked latch timing (a) with LOW2ACT asserted low and 
(b) with LOW2ACT asserted high. All waveforms are 1.2 V amplitude except 
ST1 and ST1# that are marked. 


latches. At this point, the Vcc can be floated or driven low by the 
external regulator. Floating the supply is preferred, since if the 
mode is exited soon after entrance, less power supply charge is 
needed to restore the operating supply voltage. The stored state 
is isolated via thick-gate transistors limiting the standby power 
to the leakage of the thick-gate storage elements. All N-wells are 
connected to Vectra, to avoid the increased size that well gaps 
would incur. The N-well leakage component is inconsequential. 
This also avoids discharging and charging the well capacitance 
when entering and leaving the low-standby-power mode. 

The scheme disables all logic activity in the Voc power do- 
main and since the supply is floated, eventually leaking to 0 V. 
clocks are low while in this state. Thus, the entire clock tree 
can be on the main power supply and clock tree leakage is also 
eliminated. For a design predominantly using rising edge trig- 
gered flip-flops or pulse-clocked latches, it is then best to stop 
the clock in the low phase. Having resolved the majority of the 
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Fig.4. Simulated operational waveforms, showing entrance into and exit from 
the low-power standby state. 


cases with clock low, uncommon but very important cases are 
left and are discussed in Section III. 


C. Exiting Standby Mode 


To exit the low-standby-power mode, the signal LOW2ACT 
is asserted high, turning on M1 and providing a ground con- 
nection to transistors M2 and M3 that differentially sink cur- 
rent to set the state of the thin-gate latch upon power-up. As the 
supply increases from 0 V, the thick-gate transistors, having full 
gate overdrive of Vectra — V:, overpower the thin-gate transis- 
tors while they are in subthreshold operation. This forces the 
thin-gate storage to the correct state as it powers up, as in the 
ferroelectric shadow state storage for SRAMs described in [17], 
[18]. In the event that the supply does not completely collapse, 
the thin-gate latch state is not lost until the cell is sufficiently 
weak to allow writing via transistors M2 and M3. This case, 
where the Vgg does not fully collapse, is shown in Fig. 4, where 
the thick-gate transistors M2 and M3 must overpower the latch 
drivers I1 and IT1. Specifically, in Fig. 4 the thin-gate latch 
is purposely reversed after writing the state to the thick-gate 
shadow latch. The Voc supply is then only collapsed to 300 mV. 
Nonetheless, the “one way” circuits correctly write the thin-gate 
latch state. 

While it would have been possible to use the pass transistors 
M4 and M5 to write the thin-gate state during power-up [13], 
we found that this was less robust in the event of incomplete 
Voc supply collapse combined with operation at process cor- 
ners. Consequently, the “one-way” design shown was adopted 
despite the added size. To suppress J,.¢¢ due to transistors M2 
and M3 while in standby, they must also be thick gate. Still more 
thick-gate transistors could have been added to make the write 
into the thick gate one-way as well, but due to the large drive 
difference to be expected between the thin and thick-gate tran- 
sistors, easily ensured by proper sizing, this is unnecessary. 


ACT2LOW 7 
U 





LOW2ACT 





Fig. 5. Register file cell with thick-gate state retention devices. 


D. Register Files 


The register file design is shown in Fig. 5. The differential 
write assures good write performance at low voltages. The static 
NOR gate allows the pull down transistor to be half the width 
of a similar strength conventional stack. Thus, it lessens the dy- 
namic domino read bitline (RBL#) load as well as limiting the 
leakage produced on this high fan-in dynamic node. It also in- 
creases the noise immunity to the read wordline (RWL# in the 
figure) by interjecting a static gate before the domino input tran- 
sistor. The signal RWL# has less capacitive loading and overall 
read speed is retained. 

As in the pulse-clocked latch, the thick-gate latch is not used 
as the primary storage node due to its slow speed. Specifically, 
when the register file is written late in the second phase of 
the clock and must be read in the next phase, a timing push 
out would occur if the nodes are incompletely written. The 
register file cell operation is illustrated in the simulation results 
comprising Fig. 6. The figure also includes three different 
write bitline (WBL) timings, separated by 50 ps. The thin-gate 
register file storage latch successfully writes with even with 
very late data setup time, analogous to the pulse-clocked latch 
case already described. Note that the latest WBL timing fails 
as shown by the failed write to node $1. By using only the 
thick-gate storage in the register file, ability to time-borrow, 1.e., 
the late arrival of the write data in the write phase would have 
been sacrificed. Since the pertinent circuits are the same, opera- 
tion when entering and leaving the standby mode is identical to 
the latch previously mentioned. Use of thick-gate-only storage 
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Fig. 6. Register file cell write with late data. 


would have also limited operating speed to a clock phase length 
determined by the thick-gate latch write timing as mentioned. 


Il. APPLICATION TO OTHER CIRCUITS 


The register file late write case is also applicable to the use of 
transparent high, rather than pulse-clocked latches. Since these 
allow time borrowing of nearly a clock phase, they are valuable 
for high-performance design. Shadow latches are attached just 
as in Fig. 2. 

The use of thick-gate shadow latches is also applicable to 
master-slave flip-flops. Since the global clock is held low during 
standby as mentioned, the shadow latch is needed only on the 
slave latch. The master latch, in a transparent condition (due 
to clock low) during power-up will be set to the state required 
by preceding latches during state recovery. This limits the over- 
head significantly. It should also be noted that the slave has a 
half cycle to set the retaining element, since the write of that 
latch always occurs beginning at the clock rising edge. Alter- 
natively, a flip-flop that is negative edge triggered requires that 
the shadow latch be attached to the master rather than the slave, 
so that the state properly propagates through the transparent on 
clock low latches as set by the shadow latch state. 


A. Dynamic Logic 


High-performance microprocessors frequently include a sub- 
stantial amount of logic implemented by precharge-discharge 
dynamic (domino) logic. Even in lower performance designs, 
memories and register files are usually implemented in this 
style. Domino circuit paths must end in a dynamic to static 
conversion stage, typically a latch, which holds the output state 
through the domino circuit precharge phase. Therefore it is im- 
portant to comprehend these circuits in any low-standby-power 
scheme. 

A prototypical domino circuit is shown in Fig. 7. Here D2 
(footless) domino stages with outputs A and B are combined in 
a NAND function by a set-dominant latch (SDL) that, besides 
the NAND, functions as the dynamic to static conversion latch 
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Fig. 7. Domino circuit and SDL dynamic to static conversion latch. 


with minimal delay. A typical application would be that nodes 
A and B represent register file bitlines, with the read sense and 
latching function provided by the SDL. The NAND gate is out- 
lined in the figure, where the additional transistors provide the 
latch and output driver function. Specifically, transistors MP1 
and MNI1 provide the latch function, with the latter creating a 
path to ground through transistors MN2 and MN3. Setting the 
latch node LO to a one by asserting either nodes A or B low in- 
dependent of the clock creates the set dominance. 

As typically done to isolate the storage node L1 from the 
output, separate feedback and output inverters are used. This 
also allows different P to N ratios for the feedback and output 
inverters, separately optimizing read speed and noise immunity. 
In general, since the critical edge is LO rising, the output inverter 
P to N ratio should be skewed to speed the falling edge at node 
Q. Noting that domino signals X and Y are only asserted high 
during clock high, nodes A and B can be asserted low during the 
same clock phase. Feedback transistor MN1 provides a path to 
ground while the clock CLK is low. 

The timing is shown in Fig. 8. At the clock rising edge, 
the storage node LO is discharged, since nodes A and B are 
precharged high in the previous (clock low) phase. When either 
node A or B is discharged low (only A is discharged in the 
figure), the latch immediately follows via the single PMOS 
pull-up transistor MP3 that comprise one of the two PMOS 
pull-ups MP2 and MP3 of the NAND gate. In Fig. 8, signal X 
rises in the clock high phase, discharging node A, which prop- 
agates to the output and is latched as shown. In keeping with 
the register file example, this corresponds to node BITOUT 
in Fig. 5, while node A corresponds to node RBL#, the read 
bitline in Fig. 5. 


B. Dynamic Logic Standby Operation 


Since the clock is held low in standby, all domino circuits 
that evaluate while the clock is high (phase 1 domino) are in 
the pre-charge state when entering and leaving the low-power 
mode and the set dominant latch (SDL) dynamic to static con- 
verter latch holds the previously evaluated state. By adding a 
thick-gate shadow latch to the SDL (see Fig. 9), the proper 
state is restored to the circuit before returning to active oper- 
ation. Clock low (phase 2) domino circuits are evaluating upon 
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Fig. 8. Domino logic and SDL dynamic to static conversion operation. 
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Fig.9. Thick-gate NAND set-dominant latch dynamic to static converter with 
integrated thick-gate scan slave and state retention latch. 


entrance to the low-power state. Hence, the half-latches com- 
prised of their PMOS keepers may represent the proper state. 
This presents the problem of where to keep this state while in 
the low-power mode, as well as how to avoid falsely discharging 
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Fig. 10. Phase 2 (clock low evaluate) domino clock control (a) and simulation 
showing storage and reproduction of dynamic circuit state when entering and 
exiting the low-power state (b). 


domino nodes that could in turn, disrupt downstream state nodes 
when the supply is restored. 


C. Return From Standby for Dynamic. Circuits 


Precharging all evaluating domino while exiting the 
low-power state and subsequently allowing them to re-evaluate 
after return to the active state solves this problem. It also 
eliminates the possibility of erroneous domino operation at 
very low Voc, where the sum of NMOS transistor off currents 
may become comparable with the keeper on current. This 
condition will cause the local domino node half-latch to be 
upset. The precharge and re-evaluate is accomplished by using 
the return to active signal, LOW2ACT, to enable the local 
clock buffer used for clock low domino as shown in Fig. 10(a). 
The low-phase domino clock, CLK#dominoCLK is forced 
low while LOW2ACT is high, forcing the domino node into 
precharge as power is restored, as evident in the figure. When 
LOW2ACT falls, this clock rises causing the domino gates to 
evaluate before active operation begins. This clock assertion is 
simply forced by the LOW2ACT signal input to the NOR gate. 
Thus, the domino inputs are set by the shadow latches, and the 
domino nodes are returned to their proper state by the single 
evaluate clock edge, independent of the state that they powered 
up in or collapsed to under a low supply voltage condition. 

Fig. 10(b) shows a circuit simulation of this operation. The 
signal ACT2LOW_SCLKB is asserted to write the thick-gate 
shadow latch as before. Another clock cycle alters the state 
of the domino node and the supply is subsequently collapsed, 
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Fig. 11. Scan mode circuit operation 


also as before. LOW2ACT is asserted while the thin-gate power 
supply Voc is returned high. This precharges the domino gate 
by forcing the active high clock signal CLK#dominoCLK low. 
The rising edge of CLK#dominoCLK re-evaluates the domino 
gate with inputs driven by the preceding shadow state. The SDL 
latch node LO is shown to follow the evaluate, including the 
glitch due to precharge propagation. The clock then resumes 
with the clock low evaluate domino gates in the correct state. 


IV. SCAN DESIGN 


By requiring an extra latch the scheme increases the overall 
circuit area as mentioned. However, using the shadow latch 
as the scan slave as illustrated in Fig. 9 can mitigate the area 
increase. To limit the increase in loading on the high-perfor- 
mance latch, it is written differentially in scan. Separate scan 
clocks allow nonoverlapping clock operation in scan, using 
the SCANCLKA and SCANCLKB (ACT2LOW) signals. This 
allows looser routing of the scan clock signals, which can be 
treated by routers as signals rather than clocks, as well as elim- 
inating race-through conditions on the scan chain. No separate 
scan enable signals are required. Operation is shown in Fig. 11. 
Referring to Fig. 9, signals DinA and DinB are high (held 
in precharge) and CLK is held low. The data is then scanned 
into the thin-gate latch by asserting SCANCLKA, and into 
the thick-gate slave by asserting ACT2LOW_SCANCLKB, 
respectively as shown. Race through risk during scan is also 
lessened by the relatively low performance of the thick-gate 
slave latches, but limits scan operation to 300 MHz for the 
reasons described previously (note the slow latch transitions). 
This limitation should not have significant effect on test time or 
usability of the scan feature. Since there are few extra signals 
and supplies and given that auto-placed and routed logic blocks 
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Fig. 12. Shmoo plots showing operational voltages. In (a) the voltages are 


imbalanced while entering and exiting the low-power mode and in (b) only upon 
exit 


are generally wire limited, the impact on block size is minimal 
in that case. For register files, which do not require scan capa- 
bility, some of the size impact can be limited by placing the 
thick-gate devices under metal limited’ portions of the cell and 
is inversely proportional to the number of ports. 


V. EXPERIMENTAL RESULTS 


A test die containing a 32 entry translation lookaside buffer 
built from the register file cells (as well as CAM cells) and four 
scan chains of 2000 pulse-clocked latches each, containing at 
total of 9920 thick-gate state retention latches, was fabricated in 
a 90-nm process. A die plot is shown in Fig. 13, where the test 
structure is 600 x 1700 jm and the active circuits are 0.51 mm? 
in area. Fig. 12(a) is a Shmoo plot showing the passing and 
failing voltages on the thin and thick-gate domains. At the in- 
tended operating point, i.e. Voc = Vectra, successful oper- 
ation is shown down to 0.8 V. As the voltage on the thick-gate 
domain is raised above the thin-gate domain, charge sharing can 
cause the write to fail, upsetting the thin-gate domain rather than 
writing the thick-gate domain. At high Voc and low Vectra. 
the thin-gate transistors and large capacitance of the thin-gate 
latch overpower transistors M1-M3, so the correct state cannot 
be written back. In actual usage Voc will be strictly equal to 
or lower than Vecrag due to leakage. In the former case, the 
state is retained and in the latter case, correct operation has been 
confirmed as in the simulation results described in Fig. 4 and 
shown in Fig. 12(b). The test die is comprised of minimum sized 
latches, while a real design will use a mix of large and small 
latches. Larger latches are less susceptible to back writing, so 
the measured results constitute a worst-case. 

The thin-gate threshold voltages were measured on e-test 


structures to be Vin = 413 mV and Vi, = —456 mV, while 
the thick-gate threshold voltages were Vi, = 900 mV and 
Vip = —760 mV. Both sets of values are higher than the 


targets. The very large thick-gate values and imbalance in 
particular, account for the relatively high measured minimum 
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operating voltage (Vccmin), by causing the write to be sub- 
threshold through the thick-gate NMOS series transistors. 
The thick-gate threshold voltages can be lowered substan- 
tially without affecting the standby power, allowing improved 
Voom. Additionally, lower thin-gate threshold voltages will 
improve active power while not affecting that in standby. 

Measured total power supply current on Vectra in the 
low-power state was between 2 and 6 mA, corresponding to 
202 to 606 pA per cell, respectively, depending on the die and 
voltage, at room temperature. This is attributable to aggressive 
halo doping and consequent band-to-band tunneling at the 
drain edges. Depending upon process architecture, this can be 
lowered substantially. It may also be addressed via design by 
lowering the Vecrg supply voltage while in the low-power 
state. This will be a topic of future work. 


VI. DISCUSSION AND CONCLUSIONS 


Standby leakage presents a considerable obstacle to transistor 
scaling for future battery operated devices. We have presented 
a latch design that allows low standby power for sub 130-nm 
processes, which have gate leakages that in and of themselves 
exceed typical 100 ;1A standby limits for an IC. The number of 
transistors in the design is limited, helped in large part by the use 
of pulse-clocked latches rather than master-slave flip-flops. This 
choice improves performance, energy and size. For instance, 
the master-slave design of [6] requires 32 transistors while this 
design requires only 21. The previous design is also prone to 
charge sharing during power-up, while in the design presented 
here, a one-way write to the thin-gate high-performance do- 
main alleviates any possibility of back-writing in the event of 
incomplete supply collapse. In our design, the latch speed is 
optimized for high performance by bypassing the thick-gate 
high-V; transistors during active operation while low standby 
power is achieved by storing the state in low-leakage transistors. 
The approach has been shown to be applicable to a wide range 
of static and dynamic circuits. Finally, the added size due to 
the larger thick-gate transistors and increased spacing between 
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Die plot of the test chip. The four shift register arrays as well as the TLB are evident left to right. 


thin and thick-gate devices is effectively mitigated by using the 
thick-gate storage element as the scan slave. Non-overlapping 
clocks in the scan mode of operation alleviates race-through. 
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A High-Performance Very Low-Voltage Current 
Sense Amplifier for Nonvolatile Memories 


Antonino Conte, Gianbattista Lo Giudice, Gaetano Palumbo, Senior Member, IEEE, and Alfredo Signorello 


Abstract—A high-performance sense amplifier for nonvolatile 
memories capable of working under a very low-voltage power 
supply is presented. The topology of the sense amplifier uses a 
pure current-mode comparison allowing power supplies lower 
than 1 V to be used and includes two subcircuits which improve 
slew rate performance. 

The sense amplifier was implemented in an EEPROM real- 
ized with a 0.18-4zm EEPROM technology. Experimental results 
showed a read access time of about 30 ns with a power supply of 
1.65 V. 


Index Terms—Current mode, EEPROM, low voltage, non- 
volatile memory, sense amplifier, smart card. 


I. INTRODUCTION 
\ 7 ARIOUS electronic systems used in telecommunications 


(pagers, mobile telephones, etc.), in consumer products 
(smart cards, palmtops, digital video cameras and cameras), 
and in personal computers (BIOS) require nonvolatile memo- 
ries with high speeds in both read and write operation modes 
as well as low power consumption [1]-[3]. The need for very 
low power consumption, which increases battery life time and 
portability, has become a key design aspect particularly for 
portable electronic equipment. To satisfy the low power con- 
straints in the digital circuit domain, the customary way is to 
reduce the power supply voltage [4]-[12]. Hence, a 1.5-V-only 
(or even lower) nonvolatile memory is required in keeping with 
present voltage reduction trends [13]-[17]. 

An important example of portable microelectronics systems 
are Smart Cards, which have become of daily use in the last few 
years. Smart Cards, usually of the same dimension as credit 
cards and made of plastic materials, incorporate a microsystem 
containing several electronic subsystems that allow elaboration 
and memorization operations [18]. Contactless Smart Cards 
that derive their power supply from radio signals have become 
a trend [19], [20]. In this type of application, low-voltage 
nonvolatile memories, and in particular EEPROM, are needed. 
Moreover, given that the time interval when the Card is supplied 
is quite limited, the memories adopted must have extremely 
high read and write ratings. These requirements are difficult to 
satisfy when the objective is also to lower the supply voltage 
[13]. 
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Block scheme of a conventional sense amplifier. 


Read speed is mainly determined by the read path, which is 
affected in a nonnegligible way by the sense amplifier’s speed 
performance, and becomes critical when the power supply is 
reduced [21]. 

This paper focuses on a novel topology sense amplifier for 
nonvolatile memories, capable of operating at voltages as low 
as 1 V, and satisfying speed constraints. This sense amplifier 
operates under very low voltage without needing special low 
threshold voltage devices. The pre-charging speed performance 
of the bitline is still preserved, despite avoiding recourse to cas- 
coding techniques for the pre-charge scheme to overcome low 
power supply limitations. These two features make the proposed 
scheme particularly appealing in standard memory processes 
and very low-voltage range of applications. 


Il. SENSE AMPLIFIER FOR NONVOLATILE MEMORIES 


The reading operation of an EEPROM or Flash is performed 
by sensing the current cell under well-defined biasing condi- 
tions. In particular, a programmed EEPROM cell has a low 
threshold voltage, giving a high level current under the bias con- 
dition. In contrast, an erased EEPROM cell has a high threshold 
voltage, giving a low level current. The convention for a Flash 
memory cell is reversed. Read operation can clearly be achieved 
by comparing the current cell with a reference current gener- 
ally provided by another cell normally linked with the process 
characteristics. Although the natural read operation can be per- 
formed in a current mode approach, traditionally a voltage mode 
operation is adopted. In fact, read operation is implemented by 
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Vpop 





using a voltage sense amplifier, which compares the voltage 
after the current is converted to voltage (Fig. 1). 

In general, differential sense topologies, which have greater 
advantages than the corresponding ‘single-ended version, are 
used. Their classic topology is based on the conventional block 
scheme in Fig. 1, where Jc and Ipgr model the cell current 
and the reference current, respectively, and Vour is the sense 
amplifier output voltage [2], [21]. The current mirror M3—M4, 
with a mirror aspect ratio lower than one (and typically set 
to 0.5 to ensure an equal delay for a 1 or 0 read), is used to 
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Fig. 5. Circuit block to improve the pre-charge phase of the sense amplifier. 
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3.3/0.88 
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appropriately scale the current of the reference cell. Since a 
fundamental role is played by the pre-charging method in any 
sensing scheme for non volatile memories, traditionally this 
task has been accomplished by adopting cascoding techniques 
using an inverter with a source follower output stage and a 
unitary feedback loop. This approach allows fast pre-charging 
independently of the capacitive load represented by the bitline 
of the array (MAT side in the sensing scheme). Although 
this solution is very useful down to a power supply of 1.8-V, 
it shows nonnegligible limitations once the power supply is 
lowered further. This occurs because the source follower does 
not correctly bias the bitline at the desired level (imposed 
by technology constraints) and also affects the reading speed 
performance. 

Although the block scheme in Fig. 1 is not suitable for low- 
voltage operation, introducing some of the modifications pro- 
posed in literature can allow its use with low-voltage nonvolatile 
memories, albeit at the cost of reducing performance [14]-[16]. 

In particular, the solution proposed in [14] based on the 
so-called self-biasing bitline sensing scheme, exploits the 
charge sharing effect between the dummy bitline (one for every 
sense amplifier) and the addressed bitline. In addition, it uses 
an n-channel transistor in cascoding configuration to separate 
the capacitive net of the bitline from the net used to perform 
the comparison. There are two drawbacks to this solution. One 
is control of the final bitline pre-charge level, which depends 
on the power supply (usually half the power supply), which is 
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Fig. 6. Detailed scheme of the low-voltage sense amplifier. 
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SIMULATION RESULTS 











itself not under control. The other is the need for an n-channel 
transistor to implement the cascoding of the bitline, thereby 
restricting very low-voltage operation. 

The solution presented in [15] requires special low threshold 
voltage transistors which are not strictly mandatory, even if they 
can profitably used in other memory subcircuits (such as charge 
pumps). Another drawback is represented by the control of the 
bitline voltage biasing, which is not suitable for power supply 
voltages much lower than 1.5 V. 


Ill. VERY LOW-VOLTAGE SENSE AMPLIFIER 


The key idea behind the proposed topology is based on imple- 
menting a true current comparison operation [22], [23]. Current 
comparison is performed simply by a current mirror loaded with 
a current generator. According to the block scheme in Fig. 2, 
where OUT] is sense amplifier output voltage, Jc and Jppr are 
the cell current and the reference current, respectively, and Ig 
is a bias current, the output voltage tends to the power supply 
when Jc, is greater than Jppr, otherwise it tends to ground. 
In particular, for small differences between Ic and JRpr, the 
output voltage swing around the bias condition, is given by 


AVout Sani Tout Io ay TREF) (1) 


where rout is the small-signal resistance at the output node. 
Of course, the mirroring behavior disappears when current dif- 
ferences produce huge voltage swings. The output voltage be- 
comes equal to the power supply or ground, as transistor M2 is 
forced to work in cut off or in the linear region, respectively. 


A. Sense Amplifier Core 


The drawbacks of the simple block scheme in Fig. 2 are due 
to the bias voltage required on the bitline node (i.e., node BL 


Minimum current 


compared 


in Fig. 2) before the memory cell is connected (i.e., before the 
current Ic is applied). This must be accurately set to around 0.8 V, 
because it coincides with the drain node of the EEPROM cells 
and hence affects the cell current being sensed.! In particular, 
with typical current bias values, threshold voltage and process 
parameters, a minimum transistor size cannot be used. To over- 
come this drawback, and define the bias voltage on the bitline 
in a sufficiently insensitive manner, the block scheme in Fig. 3 
was adopted. It is based on a p-type current mirror which sets on 
the diode connected NMOS transistor the same current which 
flows in an equally sized NMOS transistor with the required 
bias voltages on its gate. As shown in the Appendix, after setting 
transistors M3 and M5 equal to M1 and M4, respectively, and 
neglecting channel length modulation, as well as short channel 
effects, the voltage on the bitline when the memory cell is not 
connected (i.e., with I~ = 0) equals the reference voltage, Vapr. 

The circuit in Fig. 3 maintains the low voltage features of the 
current mirror scheme in Fig. 2. Indeed, it can work with a power 
supply as low as a threshold voltage plus a saturation drain- 
source voltage, which with modern technologies means a value 
lower than 1 V. Under this extremely low-voltage power supply 
the drawback is the substantial difference between the drain- 
source voltages of the two transistor couples M1, M3, and M4, 
M5, that determines a non negligible error between the voltage 
reference and the resulting bitline voltage. However, as can be 
simply derived from the relationships included in the Appendix, 
with power supply voltages around twice the minimum power 
supply (i.e., 2Vr + 2Vps sat, Where Vps sat is the drain-source 
saturation voltage of a transistor) an ideal matching condition 


'Remember that the level of current of a nonvolatile cell under the different 
conditions (erased and programmed) changes varying the voltage drain, and tech- 
nology is generally characterized for only a typical value. 
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Fig. 7(a). Simulation results of the sense amplifier under a 1.65-V power supply assuming the cell deleted with a cell current equal to 6 1A (upper plot), the cell 


programmed with a current cell equal to 13 j1A (lower plot). 


between the drain-source voltages can be achieved by cancelling 
the channel length modulation effects. Thus, the resulting bitline 
voltage is ideally equal to the reference voltage. 


B. Slew Rate Increase 


Although the scheme in Fig. 3 is simple and efficient it ex- 
hibits the typical drawback of a limited slew rate (afflicting any 
class A amplifier) which limits speed performance to pre-charge 
the bitline at the required voltage. Indeed, the bitline represents 
a heavy capacitance load, Cpy, and the time slot required to 
charge it at the bias voltage, Vag, is equal to 


tore. =| =o VR BE 
I; 


where /3 is the saturation current of M3. It is evident that to 
reduce the pre-charge time we need to increase the bias current. 
However, this proportionally increases power consumption. 

To overcome this shortcoming, two adjoining circuits were 
added which increase the current so it charges the bitline only 
in the pre-charge time slot. The former, reported in Fig. 4, 
increases the reference voltage Vrrr at the gate of transistor 
M3, which progressively decreases until the required value is 
reached. In particular, when the bitline is discharged, transistor 
M9 is switched off and transistor M7 and M8, which have equal 
width, become equivalent to a diode connected transistor with 
the same width and a length equal to the sum of the lengths of 
M7 and M8. Note that under this condition M7 and M8 work in 
saturation and the triode region, respectively. During the initial 
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Fig. 7(b). 
plot), the cell programmed with a current cell equal to 13 yA (lower plot). 


fast pre-charge phase the current used to charge the bitline is 
equal to the bias current multiplied by the mirror factor formed 
by the series of M7 and M8 on one side and transistor M3 on 
the other given by 


mae a 


Ss Aha a: (3) 
(W/L) 73 


When the bitline reaches an NMOS threshold voltage, transistor 
M9 begins to sink the current reducing voltage reference Vepr, 
because the drain-source voltage drop of M8 (or M9) is de- 
creased. To obtain design relationships, which relate the steady 
state voltage reference, Vapp, to the transistor aspect ratios, we 
can also assume M9 is equal to M8 and M7. In the steady state, 
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(Continued.) Simulation results of the sense amplifier under a 1-V power supply assuming the cell deleted with a cell current equal to 6 j1A (upper 


transistor M9 has the same gate voltage as M8, and we can ap- 
proximate M8 and M9 with an equivalent transistor of the same 
length and width equal to twice M8 (or M9). 

To further increase speed during the pre-charge phase, a cir- 
cuit providing an adjoin current only during the pre-charge time 
slot is added as well (see Fig. 5). When the bitline voltage is 
lower than an NMOS threshold voltage, the circuit feeds an ad- 
join current to the bitline node which is equal to the bias current 
amplified by gain K2 with the two current mirrors M10—M11 
and M12—M13 in Fig. 4. After having reached the threshold 
voltage, transistor M14 switches on and a current mirror be- 
tween M1 and M14 is created. Then transistor M14 sinks current 
Tpias, Which means the circuit M10—M13 is switched off. 



































Fig. 8. 


Sense amplifier microphotograph (inside the bold circle). 
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Fig. 9. Experimental results from the sense amplifier read access time on an 
erased cell. 


In conclusion, the pre-charge time is reduced through the two 
circuits in Figs. 4 and 5, which can be split into two main con- 
tributions approximated by 


Csi pots ACBL 


RR Gh 74ektHt Vj Vi 4 
; Tyias( K1 + K2) LG Tig (VREF tH) (4) 


where current J is given by M3 in saturation with a gate voltage 
equal to Varr. 

The complete sense amplifier is shown in Fig. 6. To properly 
amplify the output voltage, a two stage amplifier is added. The 
first stage compares the internal output, OUT1, with the voltage 
on the bitline (i.e., the reference voltage). It is made up of the 
low-voltage differential amplifier M15—M16 biased with two 
current generators, J},;,;, which include a folded mirror active 
load to improve the gain by a factor of two without limiting the 
minimum allowable power supply [24]. The second stage is the 
simple inverter M19—M20. 
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programmed cell. 


IV. SIMULATION AND EXPERIMENTAL RESULTS 


The very low-voltage sense amplifier presented in previous 
sections was integrated in a EEPROM memory fabricated in a 
0.18-4m EEPROM technology using the transistor aspect ratios 
summarized in Table I, bias current, [,ia;, equal to 16 jzA, and 
Vie Of about 700 mV. 

Transient simulations setting a reference current, J,.¢, equal 
to 10.5 yA, a capacitive load which modeled all the read paths 
(i.e., the load due to the memory math and the array of sense am- 
plifiers), equal'to | pF? and various power supply and memory 
cell currents were carried out. In particular, those at a nominal 
power supply of 1.65 V under the two critical cases of amemory 
cell weakly erased and weakly programmed, were modeled with 
a cell current equal to 6 jsA and 13 1A, respectively. They are 
plotted in Fig. 7(a). In particular, the upper plot refers to the case 
of an erased cell with an Jc equal to 6 jA, the lower ones to a 
programmed cell with an Jc equal to 13 yA, and the middle 
plot shows the level of cell current both in the erase and in the 
programmed case and the reference current set to 10.5 pA. It is 
worth noting that the latter case is the most critical since the cell 
current is closer to the reference current than the other one. The 
output data obtained sampling the signal S,,¢ is also shown in 
Fig. 7(a), where it is named DOUTEE. 

To highlight the variation on the simulated performance of 
the sense amplifier, simulation results using different power sup- 
plies and temperatures are summarized in Table II. In particular, 
a read access time lower than 50 ns for a current difference of 
1 A can always be obtained. 

In order to show the correct behavior of the sense amplifier 
with a 1-V power supply, transient simulations setting a refer- 
ence.current, /,.f, equal to 9 4A, a capacitive load equal to | pF, 
under the two critical cases of a memory cell weakly erased 
and weakly programmed, are plot in Fig. 7(b). In particular, in 


2For the capacitive load it is used a typical value. Indeed, for a 64 kB memory 
we have 512 cells connected to each BL, resulting in an equivalent capacitive 
load of about 450 fF. The metal interconnection which connect all the cells drain 
has a parasitic capacitance of about 370 fF. Finally, the bus interconnection be- 
tween sense amplifier and colum decoder gives a contribute of about 250 fF. 
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Fig. 7(b) the upper plot refers to the case of an erased cell with 
an Ic equal to 6 yA, and the lower ones to a programmed cell 
with an Ic equal to 13 wA. Moreover, in the last row of Table II 
read access time and minimum current compared for 1-V power 
supply at 27°C are reported. Of course, at 1-V power supply the 
access time is increased, but as shown in Fig. 7(b) the sense am- 
plifier behavior is correct. 

The sense amplifier has a silicon area of about 600 jum? and 
its microphotograph is shown in Fig. 8. 

Experimental results are plotted in Figs. 9 and 10. In partic- 
ular, Fig. 9 refers to an erased cell (i.e., a cell current lower than 
the reference current set to 10 A) and Fig. 10 to a programmed 
cell. Measurements were carried out on a reading cycle of an 
erased cell at 1.65 V. The measurements show a correct output 
level after about 20 ns, allowing a read access time of about 
30 ns. Moreover, as expected, measured average current con- 
sumption is about 60 A. More specifically, it is equal to 60 wA 
if measured in a time window of 100 ns, while it is equal to 
76 A if measured during the read period of about 38 ns. 


V. CONCLUSION 


A current sense amplifier solution for nonvolatile memories 
has been presented. The circuit exhibits good performance over 
a very low-voltage range, allowing extensive control of both 
speed and bitline voltage levels, even under the extreme condi- 
tion of power supplies as low as 1.35 V. Moreover, the absence 
of any cascoding technique in the bitline pre-charging scheme 
allows the circuit to function with power supplies as low as 1 V, 
as a power supply higher than the sum of a threshold voltage 
and a drain-source saturation is needed. 

The sense amplifier was implemented and validated with a 
0.18-zm EEPROM technology for Smart Card applications and 
enables a read access time lower than 30 ns. 


APPENDIX 


Using the well-known Shicman—Hodges equation which 
means neglecting short channel effects, on the circuit in Fig. 3 
when Jc = 0 the ratios of the drain current of transistor M1 
and M3 and that of transistor MS and M4, [,/J3 and [,/J3 
respectively, is given by 


I, — (Vas: — Vrn)(1 + AnVes1) 
Iz (Vane — Vrn)(1 + AnVbss) 
ic (Vesi — Vrn)(1 + AnVes1) (Al) 
(Varr — Vrn)[1 + An(Vpp — Vsea)] 
Is, ae Ap Vsp5 ot Ap(Vpp — Vas1) (A2) 
I, 1+A)Vse4 1+ ApVsea 











where V7, is the threshold voltage of the NMOS transistor, A, 
and X,, are the channel length modulation parameters of NMOS 
and PMOS transistors, respectively, and the other parameters 


have the usual meaning. Since currents J, and J3 are equal to J; 
and J4, respectively, we get 


(Vesi — Vrn) (1 + AnVes1) 
1+ ve (Vpp ws Ves) 
(Veer ee Vin) (1 ai Xn (Vpp "aay, Vsca)] 


= : A3 
1+ Ap Vs@4 A ) 





Relationship (A3) states that Vasi = Vrer, neglecting the 
channel length modulation (i.e., A, = A, = 0), or matching 
the source-drain voltages for the NMOS and the PMOS tran- 
sistor couple. The same results can be achieved by considering 
short channel effects. 
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A Novel High-Speed Sense Amplifier 
for Bi-NOR Flash Memories 


Chiu-Chiao Chung, Hongchin Lin, Member, IEEE, and Yen-Tai Lin 


Abstract—A novel high-speed current-mode sense amplifier 
is proposed for Bi-NOR flash memory designs. Program and 
erasure of the Bi-NOR technologies employ bi-directional channel 
FN tunneling with localized shallow P-well structures to realize 
the high-reliability, high-speed, and low-power operation. The 
proposed sensing circuit with advanced cross-coupled structure by 
connecting the gates of clamping transistors to the cross-coupled 
nodes provides excellent immunity against mismatch compared 
with the other sense amplifiers. Furthermore, the sensing times 
for various current differences and bitline capacitances and re- 
sistances are all superior to the others. The agreement between 
simulation and measurement indicates the sensing speed reaches 
2 ns for the threshold voltage difference of lower than 1 V at 
1.8-V supply voltage even with the high threshold voltage of the 
peripheral CMOS transistors up to 0.8 V. 


Index Terms—Advanced cross-couple, Bi-NOR, clamping tran- 
sistor, flash memory, FN tunneling, mismatch, threshold voltage. 


I. INTRODUCTION 


OR contemporary memories, array structures and pe- 

riphery circuits, such as decoders, charge pumps, level 
shifters, and sense amplifiers, determine the overall system 
performance in terms of power dissipation and access speed. 
The high-speed low-power sense amplifier is one of the critical 
components. Due to low-voltage operation, current sensing 
techniques have received a lot of attention in the last decade. 
Many sense amplifiers based on cross-coupled transistor struc- 
tures were designed to overcome the loading effects [1]-[3] 
for DRAM or SRAM, but few have been discussed about the 
mismatch of sense amplifiers. Another category of memories 
is flash memory [4], [5]. The trend is not only high-density 
and low-voltage, but also multi-level. Therefore, the threshold 
voltage deviation of the programmed memory cells has to be 
well controlled for low-voltage operation.. The sense amplifiers 
require high sensitivity and excellent mismatch immunity in 
threshold voltage and W/L (channel width/channel length) 
ratio of devices. 

For flash memories, comparison of current difference be- 
tween the flash cell and the reference cell is the direct and fast 
method to read the data. However, for the Bi-NOR [6], [7] flash 
memory arrays, most of the sensing circuits developed for the 
conventional flash memory cells [8], [9], such as the simple 
four-transistor sense amplifier [10], PMOS bias type sense 
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amplifier [11], and differential latch type sense amplifier [12], 
are not appropriate. Since these sense amplifiers were designed 
for draining cell current at the drain node of the flash cell, 
their bitlines were usually pre-charged to high before sensing. 
However, the current direction for Bi-NOR cells is reversed. 
The sense amplifier drains the current of the flash cell at the 
source node, thus the. bias at the bitline source node has to be 
low enough for the cell current flowing to the sense amplifier. 
Though the clamped bitline (CBL) sense amplifier [13] was 
appropriate for the Bi-NOR cells, it would result in higher 
power consumption, lower sensing speed, and poor mismatch 
effects due to the equalization of the bitlines before sensing. 

To comply with these restrictions, we propose a new sense 
amplifier (NSA) that utilizes advanced cross-coupled structure 
by connecting the gates of the clamping MOS transistors to 
the cross-coupled nodes to improve the mismatch characteris- 
tics and reduce the power consumption without scarification of 
sensing time. The mismatch is also improved if the equaliza- 
tion between the drains of the two clamping MOS transistors is 
removed, since the currents from the selected cell and the refer- 
ence cell slightly charge the drains before sensing. 

The new circuit and its operation principle for Bi-NOR cells 
are described in Section II. Section II1 compares the sensing 
speed versus threshold voltage difference, bitline capacitance, 
and channel length mismatch with the clamped bitline sensing 
scheme. The theory of mismatch improvement is also given in 
this section. In Section IV, the measurement results show the 
agreement with simulations. Section V is the conclusion. 


II. THE NEW SENSE AMPLIFIER AND ITS OPERATION 


The flash memory cell used in this study is based on the 
Bi-NOR technology [6], [7], which uses bi-directional channel 
EN tunneling with localized shallow P-well structure to realize 
the high-reliability, high-speed, and low-power operation. 
The conduction channel width of the flash cell is no longer 
one-dimensional. Fig. 1(a) illustrates the cross-sectional view 
of Bi-NOR flash memory cells. The current consists of the 
conventional current path (solid arrow) and the side conduction 
path shown by the dashed arrow. Since the electron current 
is flowing from the width, length, and bottom (deep N-well) 
directions, more than 15% read conduction current enhances 
the read performance. The typical operating conditions for 
Bi-NOR cell are listed in Table I. Fig. 1(b) shows the read path 
from an array to the sense amplifier. For a selected cell, since 
the drains of the flash cells in the same row are connected and 
biased at 1 V from the source switch, the current has to flow 
to the sense amplifier at the bitline of the flash cell. Therefore, 
the bias at the bitline must be close to zero to comply with 
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the requirement. This new operation makes most of the sense 
amplifiers designed for the conventional flash cell arrays not 
appropriate for the new cell array. 

Generally, the sensing circuit is composed of a current source 
transporting the cell’s contents through the bitline to the data 
line, and a latch stage converting the differential current in the 
data line to the output node. According to the Bi-NOR cell array 
mentioned above, the new current-mode sense amplifier shown 
in Fig. 2(a) employs the cross-coupled latch structure (M1—M4) 
with sensor activation (Men) and equalization of output nodes 
(M7). Transistors M5 and M6 clamp the bitline voltage close to 
ground, and the sensing nodes (c;,, and r;,,) drain currents from 


the selected cell and the reference cell, respectively. The tot, 
and Cyitline represent the parasitic resistance and capacitance at 
the bitline. The timing diagram of signals SE, En, Nodes a, b, 
and out for the new sense amplifier is illustrated in Fig. 2(b). 

The operation of the sense amplifier can be divided into three 
phases: pre-charge, signal amplification, and reset for the next 
operation. In the pre-charge phase, the appropriate signals are 
applied to force the sensing nodes to certain potentials. In the 
amplification phase, the comparison and amplification are exe- 
cuted between the sensing nodes, so the content of the selected 
memory cell is retrieved. After that, the sense amplifier is reset 
for the next operation. 
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Fig. 2. (a) Circuit diagram of the new sense amplifier. (b) Timing diagram for 


the operation of the new sense amplifier. 


The sensing operation starts by turning on Switch Men and 
Switch M7. During the pre-charge phase, the output node volt- 
ages are equalized (Va=V b) so that the currents in M1 and M2 
are the same. For the case of J¢e > Jer, the current through 
M5 will be larger than that of M6 (Is > Ime). Therefore, 
the bias at Node c;,, is slightly higher than that at rj,,. In the 
meanwhile, since M3 and M4 are all in the saturation region, 
the gate to source voltage (V,,) of M3 is less than that of M4 
(Vas3 < Vgs4), the current through M3 is smaller than that of 
M4 U3 < Ima). 

At the end of the pre-charge cycle, M7 is turned off, so tran- 
sistors M1—M4 act as a high-gain positive feedback amplifier. 
Due to positive feedback, the impedance looking into the source 
node of either M3 or M4 is negative. That makes M3 and M4 
begin to source the currents when M7 is turned off. Since M4 
has stronger ability than M3 does to discharge the voltage at 
the node b, the different currents flowing through the drains of 
transistors M3 and M4 amplify the voltage difference across 
the output nodes (a and b) of the sensing amplifier. During the 
pre-charge phase, it is important that the sizes of clamping tran- 
sistors M5 and M6 should be chosen slightly larger to allow 
them biased in linear region, thus activating the regeneration 
procedure of inverter pairs (M1/M3 and M2/M4) as a latch in 
the later amplification phase. 

Since the inputs of the sense amplifier are low-impedance 
current sensing nodes, the high capacitive bitlines only need 


to be charged slightly for sensing operation. This results in 
the minimal influence of the sensing speed for various bitline 
capacitances and current differences. In addition, due to the 
fact that the potentials of bitlines always keep low during the 
sensing operation, power consumption is significantly reduced. 
Another important feature is improvement of the mismatch 
problem, which will be explained in the next section. 


III]. PERFORMANCE EVALUATION 


The new sensing circuit was designed and fabricated using 
0.25-m Bi-NOR flash memory technologies with 0.4-~m 
CMOS transistors with threshold voltage |V;| 0.8 V for 
peripheral circuits at supply voltage of 1.8 V. Fig. 3(a) shows 
the simulated waveforms of the signal SE, the nodes a, b, and 
out of the proposed sense amplifier in the case of Icey > Iret 
with the output load capacitance of 20 fF. The simulation results 
show the sensing speed is about 2.3 ns for the current difference 
(AI = Icey — Tree ) of 6.5 A. Fig. 3(b) gives the waveforms 
of current input nodes with bitline resistance of 320 (2 and 
capacitance of 2 pF for the flash cell (c;,,) and the reference cell 
(rin). AS mentioned before, the potentials at the sources of the 
cells are pretty low. They are pre-charged to 0.5 V at ci, and 
0.1 V at r;, for the pre-charging time of 30 ns. 

In order to evaluate the proposed design, the clamped bitline 
(CBL) sense amplifier [13] illustrated in Fig. 4(a) is compared. 
Its small-signal equivalent model for the typical cross-coupled 
circuit is given in Fig. 4(b), in which Cd is the equivalent ca- 
pacitance at the output nodes of the sense amplifier including 
the Miller capacitance from the diffusion capacitances C,q, and 
Rd includes the parallel combination of the output resistances 
of both n-channel and p-channel transistors. The clamp transis- 
tors M5 and M6 are biased in the linear region with equivalent 
capacitances Cs and conductance gy,. Resistors Ry and R34 
mimic the small impedances of switches during the equaliza- 
tion phase. For the current difference AJ = Icey—Irep > 0, 
the voltage difference between Nodes c;,, and rj, is defined as 
AV = V3 — V4. For the CBL sense amplifier, AVogr is 


re 1 
AVosr = — |Ccen+Lar3 — Al34) — 


Yds 


(Lret +14 +Al34)] (1) 


where gas is the drain-source conductance of M5 and M6 and 
AT3, is the current through R34. 

On the other hand, the new sense amplifier with the waed 
M5 and M6 are connected to the cross-coupled nodes, and the 
currents through the flash cell, M3, and M4 are denoted as I’..),, 
Th73, and I,4, respectively. Therefore, the voltage difference 
of proposed circuit AVjy5.4 between cj, and r;, without AJ34 
term becomes 


i 1 ; 
AVnsa = are (Leen + Iie) — ret + Inga) - (2) 


Gds 
For the same sense amplification capability, AVog, = 
AVwnsga, (1) should be equal to (2): 


| (Leen + Ig — Al3a) — 
= (Kou + fus).— 


(ret + Ina + Alga) | 
(Iter + Iiza)J. @) 
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Fig. 4. (a) Circuit diagram of the clamped bitline sense amplifier. 
(b) Equivalent circuit with M5/M6 denoted as resistors of 1/g.a.. 


If we assume Jy43 — Insa = I4,3 — [4,4 before the amplifier, it 
means 


(Teen — Ire) — 2(ATg4) = (Io — ret) - (4) 


40n 60n 
Time (lin) (TIME) 





(b) 


(a) Simulated waveforms of Signal SE, out, Nodes a and b of the new sense amplifier. (b) Simulated waveforms of Nodes c;,,, and rj, of the new sense 


It clearly shows that the CBL sense amplifier requires more 
current difference to compensate the offset [14], since AJ34 > 0 
due to AI = Icei—Lre¢ > 0. The basic difference of the pro- 
posed and the CBL sense amplifiers relies on the fact that the 
equalization device of the proposed circuit is not placed in the 
current path during the pre-charge phase. Thus, the proposed 
circuit provides faster response time and better mismatch im- 
munity than the CBL sense amplifier. 

The following comparisons were carried out with the same 
fan-in and fan-out conditions for both circuits with the transistor 
sizes listed in Table II. Fig. 5 compares the sensing speed and 
average power dissipation as functions of the current difference 
for given bitline resistance of 320 2 and capacitance of 2 pF 
at Vag = 1.8 V and switch frequency of 25 MHz. The simula- 
tions were performed for the current difference of the flash cell 
(Ice) and the reference cell (Jef) equal to 3 ~ 10 A. As ex- 
pected, the more current difference results in the faster sensing 
speed. It is obvious that the proposed circuit provides much 
faster sensing speed and less power consumption compared to 
the CBL sensing circuits. The reason is that the proposed sense 
amplifier does not consume sensing current of the cells to either 
compensate the current path (A/3,) offset or maintain low bi- 
ases at Cj, and rj, thus incurs less power dissipation. 

The comparison of sensing speed versus bitline capacitance 
between the proposed and the CBL sense amplifier for the typ- 
ical, best and worst transistor models with current difference 
of 10 wA at Vag = 1.8 V is illustrated in Fig. 6. According 
to the simulations both sense amplifiers exhibit ahmost constant 
sensing delay independent of the bitline load capacitance, since 
both amplifiers separate the outputs and the bitlines. However, 
the new circuit has variation of 14% between the typical and 
the best/worst cases, while the CBL has variation of 22%. The 
sensing time as functions of pre-charging time for variations in 
the capacitance and resistance of the bitlines in the memory cell 
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TABLE I 
TRANSISTOR W/L SIZES FOR THE NEW AND THE CBL SENSE AMPLIFIERS 

Transistor NSA CBL | 
M1, M2 2u/0.55 

wma 8 /0.55 1 
M5, M6 25 u/ 0.65 
M7,Men | 25 u/0.55 I 

M8 | NA 25 w/0.55 
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Fig. 5. Simulated sensing speed and average power dissipation for various 
current differences (AJ). 
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Fig. 6. Sensing speed versus bitline capacitance for different process corners 
for bitline resistance of 320 (2. 


array is plotted in Fig. 7. In general, the shorter pre-charging 
time takes the longer sensing time. It can be observed that the 
pre-charging time is longer with heavier capacitance. However, 
the variation is not large. Note that the sensing time is barely 
affected by the resistance variation. 

The mismatch in W/Z ratio or threshold voltage plays a crit- 
ical role in the symmetric cross-couple sense amplifiers, since 
it may result in erroneous sensing output. A simplified model 
shown in Fig. 8 explains the effect of mismatch in the sensing 
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Fig. 7. Sensing speed versus pre-charging time with respect to various bitline 
resistance and capacitance. 
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Fig. 8. Equivalent circuit of the new sense amplifier with threshold voltage 
mismatches. ; 








operation. The AV;, and AYV;,, represent the threshold voltage 
mismatch of PMOS and NMOS transistors, respectively, while 
Yas denotes as the identical drain to source channel conductance 
of M5 and M6. By assuming no mismatch of M5 and M6 in the 
following analysis, the worst polarity for the offset voltage in 
threshold voltage at the regenerative nodes (Nodes 1 and 2) may 
be expressed as 


Voftset we AVin ot AV ip = (Grane NV ot GmpAVitp) - Rye (5) 


where gmn and gmp are the transconductances of PMOS and 
NMOS transistors, and the offset voltage in threshold voltage 
mismatch is translated into a current mismatch at the drain with 
a gain of g,, through resistance Rj 9. Since the current difference 
between the selected cell and reference cell AJ = Jee — Dre, 
which results in a differential voltage Vaig representing the data 
of selected cell to be read. Vai can be written as 


Vai = AL Ryo. (6) 


The ratio of the differential voltage across the differential nodes 
to the offset voltage called safety margin is defined as [15] 


Fett AI aS Ag 
Vofset GmnAVen ahs GmpAVep i Test 





Margin = (7) 
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where Io¢set is effective offset current, which equals to 
GmnAVen at Duin en: 

The safety margin depends on the transconductance and 
threshold voltage mismatch of the cross-coupled devices. 
When switch M7 is on, the currents through M3 and M4 can 
be approximated as 


9 


(Vgs4 ad Vin)” : 
(8) 
where 1, is the electron mobility, C., is the gate capacitance, 
and V;,, is the threshold voltage of NMOS. 
In the case of threshold voltage mismatch shown in Fig. 8, 
the current through M3 is denoted as Jy13(mismatch) Varied by a 
mismatch AV,,, 


Pe linen ee crite Alene 
Na (Visa Ven =Iva= oes 
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mm Ces WwW r r > 
pn os (Vos3 wr a ip AVin) . (9) 
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For [ee > Irep, the source of M3 is charged by a voltage on 
sensing node c;,, denoted as V.;,,, therefore (9) can be rewritten 
as 


wln Cox Ww 7 , ; r Pac y) 
Iy13(mismate h) “a Roe [Vo3 ba (Vs3+Vein ) ay tn +AV fle 


Ln Cox Ww r r 7 7 7 Z 
= ! pe [Vo3 ot V3 ar Vin +(AVin i Vein yr 

tnCoxW ., : . 
= f OL [Vos3 Ven t (AVin 








Wat 
(10) 


where the Vj3 and V,3 are gate and source voltages of M3, 
respectively. The threshold voltage mismatch for the proposed 
circuit is reduced due to the term (AV;,, — Vein) = AVin(ws.a) 
in (10). According to the safety margin definition in (7), 
AT/Alosset, either the more current difference AJ or the 
less offset current benefits the sensing operation in case of 
mismatch arising. The proposed circuit charges the sensing 
node c;,, to reduce the offset current AJ se with the term of 
(AV,,, — Vein) instead of AV,,, in (10). However, the CBL sense 
amplifier does not have this effect due to equalization between 
Ci, and r;,. Therefore, with the same current difference for 
amplification, the proposed circuit is superior to the CBL sense 
amplifier for mismatch improvement. 

Since the threshold voltage mismatch can be equivalent to the 
geometry (W/L ratio) mismatch [15], the worst-case mismatch 
may be obtained by tuning the possible worse cases at the same 
time. Therefore, the sensing circuits were simulated using the 
center dimensions given in Table II with channel length mis- 
matches on M1, M4, and M6, which were selected as L)y1 = 
Ime + AL, Iya = Lug + AL, and Lye = Lms + AL, 
respectively, where AL is the channel length mismatch. The 
sensing speed slightly degrades with channel length mismatch 
up to AL = 0.05 ym for the new sensing circuit, while the CBL 
sense amplifier cannot afford mismatches beyond 0.015 jm in 
case of current difference AJ = 10 yA at the pre-charging time 
of 50 ns, as shows in Fig. 9. On the contrary, for the case of 
Teel < Tree, the mismatch seems not critical, since the mismatch 
helps the sensing operation. 
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Fig. 10. Chip microphotograph of the new’sense amplifier. 
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IV. EXPERIMENTAL RESULTS 


The chip microphotograph of the new circuit fabricated using 
0.25-j1m Bi-NOR flash memory with 0.4-;4m CMOS for periph- 
eral circuits is presented in Fig. 10. The test chip was designed 
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Fig. 12. Sensing speed versus various threshold voltage differences (AV ). 


using the currents generated from the selected cell and refer- 
ence cell. Each has resistor 320 22 and two parallel capacitors 
of 2 pF in between to mimic the parasitic effects in the memory 
arrays. The cell currents are obtained by applying 1 V to the 
drains of the selected cell and reference cell with different word- 
line voltages to the gates of the cells. Since the wordline voltage 
difference between the selected cell and the reference cell was 
assumed to be equivalent to the threshold voltage differences 
between them, the current difference resulted from varying the 
wordline voltage of the reference cell. Fig. 11 demonstrates that 
the on-chip measured delay time between the signals SE and 
output node for the new sense amplifier is about 2.3 ns when 
the threshold voltage difference is 0.8 V. 

The comparison of the sensing delay times between simula- 
tion and measurement for the given threshold voltage difference 
from 0.8 to 1.3 V is shown in Fig. 12. The CBL sense amplifier 
needs more current difference to compensate the offset, so it 
takes longer sensing time. The new sense amplifier with the cur- 
rents slightly charging the sensing nodes before sensing makes 
the response time shorter. The agreement between measurement 
and simulation is also observed. 


V. CONCLUSION 


A new low-power sensing circuit for 0.25-j4m Bi-NOR flash 
memory technology was designed and measured. The proposed 
scheme presents outstanding performance with sensing speed 
reaches 2 ns and power consumption less than 6 j.W at switch 
frequency of 25 MHz and supply voltage of 1.8 V. With the spe- 
cial connection of the gates to the cross-coupled output nodes, 
the immunity to device mismatch is improved significantly. That 
also makes the new current-mode sense amplifier much easier 
to design and fabricate. According to these analyses, it has also 
proven that the sensing delay of the new sense amplifier is al- 
most independent of the bitline capacitance, which indicates that 
it is an excellent candidate for higher density memory. 
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Constant-Charge-Injection Programming: 
A Novel High-Speed Programming Method 
for Multilevel Flash Memories 


Hideaki Kurata, Shunichi Saeki, Takashi Kobayashi, Yoshitaka Sasago, Tsuyoshi Arigane, Kazuo Otsuga, and 
Takayuki Kawahara, Senior Member, IEEE 


Abstract—Constant-charge-injection programming (CCIP) 
has been proposed as a way to achieve high-speed multilevel 
programming in flash memories. In order to achieve high pro- 
gramming throughput in multilevel flash memory, programming 
method must provide: 1) high-speed cell-programming; 2) high 
programming efficiency; and 3) highly uniform programming 
characteristics. Conventional source-side channel-hot-electron 
injection (SSI) programming realizes both fast cell-programming 
and high programming efficiency, but the large cell-to-cell varia- 
tion in programming speed with SSI is a problem. CCIP reduces 
the characteristic variation of SSI programming and satisfies all of 
the above requirements. By applying CCIP to 2-bit/cell AG-AND 
flash memory, the high programming throughput of 10.3 MB/s is 
obtained with no area penalty. This is 1.8 times faster than the 
throughput with conventional SSI programming. 


Index Terms—AG-AND, CCIP, flash memory, high-speed pro- 
gramming, multilevel cell, SSI. 


I. INTRODUCTION 


HE increasing application of flash memory as the main 
T storage medium of portable equipment such as digital still 
cameras and music players is creating requirements for greater 
storage capacities and faster programming. Storage capacities 
above 100 MB are required for the storage of high-resolution 
pictures in digital cameras, still or moving, and for CD-quality 
music recording in digital audio players. In addition, if we set 
a target of 10 s for downloading 100 MB of music data (data 
in MP3 audio format that plays for time equivalent to that of a 
single CD), the required programming throughput is 10 MB/s. 

The multilevel cell (MLC) technique, in which two bits 
are stored in each physical memory cell [1], [2], is one of 
the most effective approaches for expanding storage capacity. 
When multilevel programming is used, however, two main 
factors slow down the programming throughput [3]-[6]. One 
is the large swing of Vi;,, which extends the cell-programming 
time. The other is that careful adjustment takes time to narrow 
the mid-level V;;, distributions by repeated programming and 
verification. 
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The programming throughput (P7’) of multilevel flash mem- 
ories in general is expressed by 
Npit 


EN : (1) 
fan a5) Npit/ fetock oe LSet “ Dey x Ney 





where T’¢1; is the cell-programming time, Vj; is the number of 
cells being programmed simultaneously and feiock is the clock 
frequency of the interface. 7; is the time overhead which is not - 
related to verification, including the time taken to set up the in- 
ternal programming voltages. T's, is the time overhead for each 
verification, and Ns, is the number of internal programming 
and verification cycles for one programming operation. While 
three of the parameters in (1), fetock, [set and 7\,s,, depend prin- 
cipally on the peripheral circuits, the other three parameters, 
Npit, cen and Ny¢y, are strongly dependent on the cell-pro- 
gramming method. To achieve high programming throughputs 
for multilevel flash memories, a large Vj,;;, short 7.1), and small 
Nyy are indispensable. 

Programming of a multilevel flash memory cell to the highest 
level requires a large V;;, shift of 4 V, which is about 1.5 times 
as great as the shift required in a two-level flash memory. High- 
speed cell programming, that is, a short J’..1; 1s thus essential. 

We can program many cells at a time, if the current consump- 
tion of one memory cell during programming is small. Program- 
ming efficiency is the ratio of the injection current to the channel 
current (current drawn). If we are to further increase Njit, we 
need to raise the programming efficiency. 

In response to a single external program command, the mid- 
level V;}, distributions in a multilevel flash memory are sharp- 
ened by subjecting cells that fall outside the desired distributions 
to repeated cycles of internal programming and verification. 
A larger cell-to-cell variation in programming characteristics 
means a larger ,,, and correspondingly poorer programming 
performance. 

Thus, for a large Npit, short Teen, and small Nyry, the 
cell-programming method must provide: 1) high-speed 
cell-programming; 2) high programming efficiency; and 3) 
highly uniform programming characteristics. However, no 
conventional programming method satisfies all of the above 
requirements. Table I gives a comparison of programming 
methods. 


A. Fowler—Nordheim (FN) Tunneling 


FN tunneling is used in the programming of conventional 
AND- [7] and NAND-type [8] multilevel flash memories. The 
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TABLE I 
COMPARISON OF PROGRAMMING METHODS 


Bias condition 



























v a 
Cell speed ~10 us ~10 us ~ 10 us 
x v v 
Prog. efficiency ~4 ~ 106 ~ 10-3 ~ 103 
( Prog. parallelism) (~kB ) ( ~ byte ) (~kB ) (~kB ) 
Blea Se ¥ x v 
Distribution ~2.5V | ~1.5V ~4V = 1500 








advantage of this method is its high programming efficiency, 
which allows programming parallelism on the kilobyte scale 
and increases overall programming throughput. However, FN 
tunneling requires cell-programming times of 50 j1s and longer, 
as well as strong electric fields during programming. In addi- 
tion, the programming characteristics (threshold voltage distri- 
butions) are not uniform because they are highly sensitive to 
certain device parameters, such as the gate-coupling ratio [9], 
[10]. As is shown in Table I, the V;;, distribution of memory 
cells programmed through FN tunneling with no internal repro- 
gramming and verification is a large 2.5 V. Therefore, the long 
Teer and large N,¢, limit the programming throughput. 


B. CHE Injection 


Channel-hot-electron (CHE) injection is realizable in simple 
stacked-gate devices and is thus widely used in the program- 
ming of NOR-type flash memories. This method achieves both 
high-speed cell-programming (10 jis) and high uniformity of 
programming, with a V,;, distribution of only about 1.5 V [11]. 
The major drawback of CHE injection is its low programming 
efficiency (< 10~°); which is due to the incompatibility be- 
tween the optimal conditions for high hot-carrier generation and 
for electron collection on the floating gate. Therefore, CHE in- 
jection cannot achieve sufficient programming throughput for 
media-storage, because JVj,;, in (1) is only several bytes. On the 
other hand, NOR-type flash memories are used mainly for code 
storage, and in this application the low programming throughput 
of CHE injection does not affect the performance. 


Cc. SSI 


Source-side channel-hot-electron injection (SSI) [12]-[15] 
is the most suitable method in terms of both fast cell-program- 
ming (~ 10 jus) and good programming parallelism (+ kB). 
As the conditions for generating large numbers of hot carriers 
and strong injection can be made consistent, SSI programming 
achieves high programming efficiencies of more than 107? 
However, the problem with SSI is the large variation in pro- 
gramming characteristics (Vi, is distributed across more than 
4 V). To achieve high-speed programming in multilevel flash 
memories, this variation must be reduced. 


In this paper, we describe constant-charge-injection program- 
ming (CCIP), which realizes high-speed multilevel program- 
ming in flash memories. With CCIP, we achieve fast and precise 
control of V;;, by suppressing the characteristics variation of SSI 
programming. By utilizing CCIP, we obtained a short T...1; of 
10 jus, large Ny it of 8 KB, and small V;;, variation of 1.5 V. Fur- 
thermore, applying CCIP to AG-AND multilevel flash memory 
achieved a programming throughput above 10 MB/s. 

In Section II, we describe the mechanism of SSI and the 
problem with this method in terms of high-speed multilevel pro- 
gramming. Next, the concept of CCIP is presented in Section III. 
We then examine the application of CCIP to AG-AND flash 
memory in Section IV. The experimental results measured for a 
32-Mb test chip are given in Section V. In Section VI, we discuss 
potential problems of leakage current. Section VII presents our 
estimation of performance for a |-Gb AG-AND flash memory 
to which we apply CCIP. Finally, we conclude with a brief sum- 
mary in Section VIII. 


II. THE PROBLEM WITH SSI PROGRAMMING 


SSI programming realizes high programming efficiency and 
fast cell-programming. The large cell-to-cell variation in pro- 
gramming speed with SSI is, however, a problem. In this sec- 
tion, we discuss the mechanism of SSI and the problem of vari- 
ation in programming speed in terms of high-speed multilevel 
programming. 


A. High Programming Efficiency of SSI Programming 


SSI programming was developed as a way to obtain high pro- 
gramming efficiency. In the pioneering PACMOS (perpendicu- 
larly accelerating channel injection MOS) concept [12], a high 
potential at the floating gate is achieved by strong coupling with 
the drain. Since the potential of the floating gate can never be 
above that of the drain, conditions are not optimal for the col- 
lection of electrons on the floating gate. 

The split triple-gate concept [13]-[15] was developed as a 
way to realize both the generation of large numbers of hot car- 
riers and strong injection. Fig. | is a schematic diagram of the 
split triple-gate structure. An additional polysilicon select gate, 

* such as a sidewall gate, is placed on the source side of the 
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Fig. 1. Schematic view of split triple-gate flash memory programming. 
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Fig. 2. Dependence of J;,, Jas, and injection efficiency on the voltage on the 
select gate. 


floating gate. Typical internal operating voltages are 17 V for 
the control gate, 5 V for the drain, and 1.5 V for the select gate. 
This programming bias condition creates a virtual drain, which 
is an extension of the drain potential through the inversion layer 
beneath the floating gate. As a result, a pinch-off condition ap- 
pears at the boundary between the select gate and the floating 
gate, which enhances the generation of hot electrons. Some of 
these hot electrons are injected into the floating gate by the ver- 
tical electric field at the pinch-off point. 

The dependence of channel current (Js) and injection current 
(I;.) on select gate bias is shown in Fig. 2. This was measured 
for an AG-AND flash memory unit [16], an extension of the 
split triple-gate structure. Further details are given in Section IV. 
Achieving 10 jus cell-programming requires a large injection 
current of more than 70 pA. Vor, the Vin shift of the memory 
cell due to a single programming pulse, i.e., a single internal 
programming operation, is given by 


Qs An Teg x Lei 


at 2 
Ge Xie Crave 2) 


7 — 
eft 


where (Q, is the total injection charge, Cy is the total capac- 
itance of the floating gate, and FR. is the coupling ratio of the 
control gate to the floating gate. In multilevel flash memory, a 
large Vig, of about 4 V is required. In this case, as Cs, is about 
0.3 fF and R, is about 0.6, a large Ig, of 70 pA is necessary to 
achieve a short 7.1 of 10 ps. 





On the other hand, in order to achieve more than kilobyte 
parallel programming, Jy, should be no more than 100 nA. This 
is because current supply from the internal voltage source is 
limited to about 10 mA, due to restrictions on chip area and 
current consumption. 

As is shown in Fig. 2, both Jy, greater than 70 pA and Tas less 
than 100 nA can be made consistent when the voltage of select 
gate is about 1.2 V. A high programming efficiency of more than 
3 x 107° had thus been obtained; this is about three orders of 
magnitude better than the value for a conventional stacked-gate 
structure. Therefore, by utilizing SSI programming with a split 
triple-gate structure, both fast cell-programming and program- 
ming parallelism above the kilobyte scale are accomplished in 
combination with low power consumption. 


B. Variation in Programming Speed 


Here, we show the problem with SSI programming, i.e., 
the variation in programming speed. As is shown in Fig. 2, 
achieving fast cell-programming with low channel current 
requires that the select gate be operated in the subthreshold 
region. So, Jy, varies exponentially with linear variation in 
the Vi, of MOS transistors formed under the select gate. This 
variation in J4, leads to variation in programming speed. The 
charge injected into the floating gate (@,) is expressed as 


t 


Qa = fx Lan dt 3) 

0 
where ¥ is the programming efficiency. In (3), Jas is almost con- 
stant during the programming pulse. If we define the average 


programming efficiency during the whole period of program- 
ming bias as 71, the expression for (), can be rewritten as 


Q, ey 7s das X t. (4) 


The Vy, variation of select gate transistors is assumed to be 
+0.2 V in 130-nm manufacturing processes. Therefore, [qs 
varies by more than two orders of magnitude, which produces a 
large variation of programming speed. This variation increases 
the number of internal programming and verification opera- 
tions, N,s,, and degrades the programming performance. Nyty 
is expressed as 





Nvty > (5) 


~ AVin 

where Vaig is the Viz, difference between the fastest cell and 
slowest cell those are programmed without verification. AVin 
is the V;}, distribution that is intended after verification, which 
is about 0.2 V. In multilevel flash memory, a sharp V;}, distribu- 
tion is formed by the repetition of both programming and veri- 
fication. So a large variation of programming characteristics in- 
creases Ny¢, and degrades programming performance. For ex- 
ample, when Vai is 4 V, Nyey is required to be a high 20 for 
every Vin level. To reduce Nysy, we have to decrease Vai. The 
target value for Vaig is less than 1.5 V, which will reduce Ny sy 
from 20 to 8 times. CCIP [17] has been developed as a method 
to suppress the variation of SSI programming. 
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Fig. 3. Concept of CCIP. (a) Step 1. (b) Step 2. (c) Step 3. 


Ill. CONSTANT-CHARGE-INJECTION PROGRAMMING 


In conventional SSI programming, variation in J, leads to 
variation in programming characteristics. The essential point of 
CCIP is that the total amount of charge flowing through each 
memory cell in each programming operation is kept constant. 
This leads to the injection of constant charge into each floating 
gate. To obtain this constant flow of charge, each cell has to be 
equipped with a capacitor and switch. 

The concept of CCIP is shown in Fig. 3. The capacitor (Cs) is 
attached between ground and the drain node of the memory cell. 
The switch (SW) connects the drain node with VWD, which is 
the internal power supply for drain bias, V,,,~. CCIP is performed 
in three steps, with the aid of the capacitor and switch. In the 
first step, the switch is turned on and the capacitor is connected 
to VWD. The capacitor is then charged to V,,g, which is about 
5 V. In the second step, which takes place when the voltage 
across the capacitor has reached V,,,, the switch is turned off. 
In the third step, the voltage on the select gate is raised to the 
programming bias. The charge stored in the capacitor is then 
discharged through the memory cell, generating hot electrons 
which are injected into the floating gate. The total charge in- 
jected into the floating gate (Q,) is expressed as 


Vwd 
/ ax Cs dV (6) 


0 


Qy oa 


If we define the average programming efficiency across the 
whole range of drain bias from V,,¢ to 0 V as 72, (6) may be 
rewritten as 


Qy IK Cs x Vd at ee Q> (7) 


where (), is the total charge stored in the capacitor. The dom- 
inant factor in variation of the capacitance of Cs is relatively 
small (less than +5% of Cs), so Q, can be sean as almost 
constant. In addition, as is shown in Fig. 2, the variation in 
7 is about 0.2 of a decade under the condition that Vi, vari- 
ation of the select gate transistor is +0.2 V. Since y is much 
less dependent on the select gate bias than Jy,, we can obtain a 
near-constant (,. Therefore, CCIP realizes uniform program- 
ming by suppressing the variation of programming speed in SSI 
programming. In the next section, we discuss the application of 
CCIP to 130-nm AG-AND flash memory. 


Control gate VWD 
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Fig. 4. Schematic view of AG-AND flash memory programming. 


IV. APPLYING CCIP TO AG-AND FLASH MEMORY 


A. AG-AND Flash Memory 


Schematic diagrams of the memory cell and array architec- 
ture of AG-AND flash memory are given in Figs. 4 and 5. The 
assist gate (AG) is equivalent to the select gate of Fig. 1. The 
memory array is a virtual-ground structure and 256 memory 
cells are connected in parallel to the local bit-lines, each of 
which is a diffusion layer. 

Selection transistors control connection of the local bit-lines 
to the global bit-lines. One set of assist-gates (AG,) acts as the 
program gates for the selected memory cells (A and C in Fig. 5) 
while the other set, AGo, acts as the field-isolation gates for the 
nonselected transistors (B in Fig. 5). The AG set to which the 
respective AG lines belong alternates across the structure, and 
the lines are joined up just beyond the ends of the local bit-lines. 
To reduce the data-line pitch, the floating gates were embedded 
in the spaces between the AGs by a self-aligned process. The 
floating gates have a three-dimensional shape, which enhances 
the coupling ratio with the word-lines. The unit cell area is 
0.104 jum?, the data-line pitch is 0.4 jum, and the word-line pitch 
is 0.26 pum. 

Bias conditions for programming, erasure, and reading are 
listed in Table II. For erasure, a negative bias is applied to the 
selected word-line. Under this condition, electrons flow from 
the floating gates to the substrate by FN tunneling. 

The memory cell is programmed by source-side channel-hot- 
electron injection for high programming efficiency. The internal 
operating voltages are 13.5 V for the selected word-line (WL), 
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Fig. 5. Array architecture of AG-AND. 


TABLE II 
BIAS CONDITIONS FOR PROGRAMMING, ERASING, AND READING 





4.5 V for the drain, and 1.4 V for the selected AG. During pro- 
gramming of cell A in Fig. 4, the AG of cell B (AGo) is kept at 
0 V to suppress channel formation. 


B. Operation of CCIP 


As was described in Section III, realizing CCIP requires the 
addition of a capacitor and a switch to each of the selected 
memory cells, and this leads to a large increase in chip area. 
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To achieve CCIP operation for an AG-AND flash memory 
with no penalty in terms of chip area, we use the stray capaci- 
tance of the diffusion local bit-line as the capacitor and the selec- 
tion transistor as the switch. The stray capacitance of the local 
bit-line is 40 fF, which is largely composed of the capacitance 
of the p-n junction. The V;;, shift for a memory cell in response 
to a single programming pulse (Vzr,) is expressed as 


Qg 


Se ¢ Ya X Cs x Vd 
Crp ace 


Vz 
: Or x die 


(8) 


where Cg is the total capacitance of the floating gate and FR. is 
the coupling ratio of the control gate to the floating gate. As C}, 
is about 0.3 fF and R, is about 0.6, Veg, the change in threshold 
voltage with a single programming pulse is about 3.0 V. 

The timing diagram of CCIP is shown in Fig. 6, which ap- 
plies to programming of the hatched cells in Fig. 5. In the first 
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step (at 12), the gate signal of the relevant selection transis- 
tors (STDo) becomes high and the local bit-lines (LBL2, and 
LBL2,.42) are charged to 4.5 V. After charging of the local 
bit-lines is completed (at 3), the selection transistors are turned 
off. Local bit-lines LBL2, and LBL2,42 are then floating. Fi- 
nally, when AGo becomes high at f4, the stored charge in LBLo, 
and LBLox+2 is discharged through cell A and cell C. The pulse 
width (tPULSE) must be long enough for the slowest cell to dis- 
charge all of its stored charge. 


V. EXPERIMENTAL RESULTS 


A 32-Mb AG-AND test chip was fabricated in 0.13 um 
CMOS technology and is shown in Fig. 7. Key device char- 
acteristics and parameters are summarized in Table III. A 
triple-well CMOS process on a p-type substrate was used. 
The tunnel oxide of the memory cells is 9 nm thick and the 
gate-oxide layers of the high- and low-voltage peripheral tran- 
sistors are 25 nm and 9 nm thick, respectively. The word-line 
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TABLE Ii 
DEVICE FEATURES 





Process : 0.13 um p-sub CMOS triple-well 
2 poly-Si, 1 W, 2Al 

Gate oxide : 25 nm (H.V.) , 9 nm (L.V.) 

Tunnel oxide :9nm 

Interpoly dielectric :14nm 


Cell size : 0.052 um?/ bit 





107 
108 
105 
104 
10° 


102 


Number of memory cells 





Threshold voltage (V) 


Distributions of programming characteristics. 


Fig. 8. 


pitch is 0.26 jm and the bit-line pitch is 0.4 jzm. The bit area 
of the cell is 3.1 F?, for a value of 0.052 jum? with the 0.13 »m 
process. Key points from the results of measurement of this test 
chip are given below. 

Comparable results on V;,;, distribution for conventional SSI 
programming and CCIP are given in Fig. 8. The number of mea- 
sured cells is equivalent to 4 Mb. As is shown in Fig. 8, the Vin 
distribution with conventional SSI programming spans a broad 
5.5 V (Vin: 1.0-6.5 V). By utilizing CCIP, however, we dramat- 
ically narrow the V,,, distribution to span less than 1.5 V (Vin: 
over 3.5 to 5.0 V). 

Figs. 9 and 10 show the programming characteristics. The 
X-axis indicates the number of internal programming pulses. 
The length of each pulse is 1 jus. These figures indicate that 
controlling the word-line voltage (V,,,.) and the drain voltage 
(V.»a) are both effective as ways to optimize the programming 
speed. 


VI. CHARGE LEAKAGE 


This section covers potential problems of leakage that ac- 
company the proposed scheme. When the local bit-line is pre- 
charged to the programming voltage, we see two kinds of charge 
leakage from the floating drain node. The first is a p-n junction 
leakage and the second is a gate-induced-drain leakage. Since 
charge leakage reduces (, in (4), it also lowers the program- 
ming speed. 


A. Junction Leakage 


Since the storage capacitor is a p-n junction, it has p-n 
junction leakage. This leakage is determined by the breakdown 
voltage of the p-n junction (BVj). Fig. 11 shows how BVj af- 
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Fig. 11. Effect of p-n junction leakage. 

fects the dependence of the programming characteristics (drop 
in Vin) on tFLOAT, which is the period over which the local 
bit-line floats, as shown in Fig. 6. With increasing FLOAT, the 
leakage current increasingly lowers V;,,, so that more rounds of 
reprogramming and verification are required, leading to lower 
programming speeds. The results indicate that increasing BV} 
is highly effective as a way of suppressing the programming 
degradation. 


B. Gate-Induced-Drain Leakage (GIDL) 


GIDL is caused by band-to-band tunneling in the gate-overlap 
region of the drain. High values for GIDL are obtained by a high 
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Vin for all the nonselected memory cells in the columns parallel 
to the selected memory cells (Fig. 12). Therefore, suppressing 
the multilevel V;;, window is effective to suppress the program- 
ming degradation. 


VII. PERFORMANCE OF 1-Gb AG-AND FLASH 


In this section, we present estimates of the programming per- 
formance of 1-Gb multilevel AG-AND flash memory units with 
SSI and CCIP programming. 

Fig. 13 compares programming times. When conventional 
SSI programming is applied, the deviation in programming 
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characteristic increases the number of internal programming 
and verification cycles (ys, ) so that this process alone requires 
almost 1 ms. CCIP decreases Ny¢, by lowering the variation of 
threshold voltage relative to that seen in SSI programming. The 
time overhead of verification is reduced to 45% of the value 
for SSI. Given 8-kB-parallel programming, a programming 
throughput of 10.3 MB/s (i.e., 1.8 times faster than with SSI) 
is achieved. 

In addition to high-speed programming and uniformity of 
programming characteristics, CCIP has the advantage of a lower 
current requirement for programming than SSI (Fig. 14). In 
SSI programming, channel current flows through the memory 
cells during the entire programming period. So, in the case of 
8-kB parallel programming, a current source providing more 
than 10 mA is required for VWD. However, the internal voltage 
source only has to pre-charge the capacitor from the bit-line. 
That is, the bit-line voltage only has to drive 35% of the current 
required with SSI programming. 


VII. CONCLUSION 


Constant-charge-injection programming (CCIP) has been 
proposed as a method for the high-speed programming of 
multilevel flash memories. As a replacement for conventional 
FN, CCIP based on SSI is a key technology for high-speed 
multilevel programming. By utilizing CCIP, we obtained 
high-speed cell programming of 10 jus, high programming 
efficiency of more than 3 x 107%, and high uniformity of 
programming, with a V;, distribution of 1.5 V. AG-AND 
flash memory with the proposed scheme enables 10.3 MB/s 
multilevel programming. 
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A 1-GHz Signal Bandwidth 6-bit CMOS ADC With Power-Efficient Averaging 


Xicheng Jiang and Mau-Chung Frank Chang, Fellow, IEEE 


Abstract—A 2-GS/s 6-bit ADC with time-interleaving is demon- 
strated in 0.18-4:m one-poly six-metal CMOS. A triple-cross 
connection method is devised to improve the offset aver- 
aging efficiency. Circuit techniques, enabling a state-of-the-art 
figure-of-merit of 3.5 pJ per conversion step, are discussed. The 
peak DNL and INL are measured as 0.32 LSB and 0.5 LSB, 
respectively. The SNDR and SFDR have achieved 36 and 48 dB, 
respectively, with 4 MHz input signal. Near Nyquist input fre- 
quencies, the SNDR and SFDR maintain above 30 and 35.5 dB, 
respectively, up to 941 MHz. The complete ADC, including 
front-end track-and-hold amplifiers and clock buffers, consumes 
310 mW from a 1.8-V supply while operating at 2-GHz conversion 
rate. The prototype ADC occupies an active chip area of 0.5 mm?. 


Index Terms—Analog-to-digital converter (ADC), averaging, 
CMOS, interleaving, track/hold, triple-cross connection. 


I. INTRODUCTION 


IGH-SPEED ADCs are an integral part of high-perfor- 

mance systems such as disk drive read channels, fiber 
optical receiver front-end and data communication links using 
multilevel signaling (e.g., PAM and QAM). The main issues 
in the design of such ADCs include static and dynamic offset 
reduction, low supply-voltage operation, gain and speed op- 
timization. Design tradeoffs between power, speed, and chip 
area further tighten the design requirements. It is also of partic- 
ular importance that such ADCs be implemented in a standard 
CMOS process for easy integration with larger signal processing 
circuits. 

This paper presents the design of a 6-bit 2-GS/s ADC imple- 
mented in a 0.18-44m CMOS technology. The ADC performance 
in a standard CMOS process is constrained by the threshold 
mismatch of the CMOS devices. The offset averaging method 
proposed in [1] is a powerful technique to alleviate its impact 
in preamplifier or comparator arrays [1]-[4]. Nonetheless, it 
still requires further modifications to correct for optimum av- 
eraging effects at the array boundaries. This work introduces 
a triple-cross connection method to improve the averaging ef- 
ficiency. Combining such a technique with time-interleaving 
and open-loop front-end track-and-hold amplifiers (THAs), the 
converter achieves a figure-of-merit of 3.5 pJ per conversion 
step. Section II introduces the triple-cross connection method. 
Section III describes the ADC architecture and the THA circuit. 
Section IV presents experimental results obtained from the pro- 
totype ADC. 
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II. TRIPLE-CROSS CONNECTION METHOD 


A. Boundary Issues 


Averaging acts like a spatial filter that can reduce the offset 
of the preamplifier. Since it smoothes out the faster fluctuation 
more than the slower one, the differential nonlinearity (DNL) 
usually gets more improvement than the integral nonlinearity 
(INL). One way to implement averaging is by inserting ladder 
resistors between outputs of adjacent amplifiers [1]. The av- 
eraging technique, however, causes problems at the averaging 
network boundaries. In general, there are two issues with tradi- 
tional averaging networks at the boundaries. First, zero-cross- 
ings drift from input reference voltage levels due to the asym- 
metrical nature of the boundary. At the edge, the zero-crossings 
shift inward due to the lack of amplifiers on the other side. This 
drift causes systematic nonlinearity errors. Second, the number 
of random components contributing to the averaging is dimin- 
ished at the boundary. This counteracts the resulting DNL/INL 
improvement through the averaging. In other words, the stan- 
dard deviation of the input referred offset at the boundary is 
larger than at the center. Comparing with the amplifier array 
center, the input linear range of the preamplifier at network edge 
covers about a half the number of preamplifiers that can con- 
tribute to averaging. State-of-the-art designs use either dummy 
amplifiers to preserve the characteristics of an infinitely long 
amplifier array [2], [3], or resort to the extra boundary termi- 
nation circuits to suppress the zero-crossing shifts at the edges 
[4]. The dummy method can be made more effective when more 
dummies are used. For instance, 18 dummies are required for an 
averaging window that covers 18 amplifiers [3]. However, this 
makes the averaging method rather inefficient, since only a part 
of the amplifier array and the reference range are usable. The 
edge termination method consumes less power and area. How- 
ever, it only restores the systematic errors when the averaging 
window is narrow and the boundary issue is less severe. Fur- 
thermore, these methods need significant amount of extra refer- 
ence range, which represents a serious challenge for low-supply 
applications. 


B. Triple-Cross Connection 


To solve the boundary problem, the first step is to make sure 
the averaging resistor network is properly terminated. One way 
to achieve this goal, as suggested in most folding ADCs with re- 
sistive interpolation, is to cross-connect outputs at the network 
boundaries. This preserves the translational symmetry of the im- 
pulse response [5] of the resistor network, but the primary issues 
such as zero-crossing shifts and noneven averaging remain. The 
clipped outputs at the other boundary provide a strong force 
pulling the zero-crossings outward, far away from their ideal 
positions. These clipped amplifiers will not contribute threshold 
mismatch components [6] from their input differential pairs to 
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Fig. 1. (a) Preamplifier array with one cross connection at the boundary. 
(b) Preamplifiers from one side contribute to averaging at the edge. (c) INL 
profile. 


averaging. Only a part of the array has the required linearity and 
both edges exhibit significant distortion, as shown in Fig. 1. Let 
us start with adding enough dummies at both boundaries. The 
over range references can be eliminated after observing the sym- 
metry property of differential circuits. By cross-connecting out- 
puts of the dummies to that of regular amplifiers, the dummies 
can be connected oppositely to existing reference points instead 
of over range references, as illustrated in Fig. 2. By doing so, 
we have achieved the following goals: 1) the extra references 
are eliminated; 2) the zero-crossing shifts are corrected due to 
symmetry being maintained; and 3) the input linear range at the 
boundary covers an equal number of amplifiers at the edge and 
at the center, which means the random offsets are averaged in 
the same scale from the array center to the boundary. However, 
the negative transconductances from the dummies reduce the 
effective transconductance at the boundary. Also, when the av- 
eraging window is wide, a significant number of dummies are 
required. This method can be further improved by designing 
an interface amplifier instead of using the regular preamplifiers 
as dummies. Like the regular preamplifier, the interface am- 
plifier consists of an input differential pair, a current source, 
and resistive loads. The differential input devices are carefully 
sized such that the input linear region of the interface ampli- 
fier overlaps that of the adjacent regular preamplifier. The in- 
terface amplifier in Fig. 3 has a similar effect to the lumped ef- 
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Fig. 3. Interface amplifier equivalent to the dummy array. 
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Fig. 4. Triple-cross connection scheme. 


fect of the dummy array with respect to averaging. To minimize 
zero-crossing shifts and the negative transconductance, the ref- 
erence point used for the interface amplifier is three steps away 
from the end of reference ladder. There is one interface ampli- 
fier at each network boundary. Fig. 4 shows the scheme of the 
triple-cross connection method. The two crossings at the bound- 





aries minimize the zero-crossing shifts and the third crossing is 
for proper termination of the resistor network. Simulations in- 
dicate that the peak INL of the amplifier array can be reduced 
to 0.5 LSB by using the triple-cross connection method and in- 
terface amplifiers (down from 4.5 LSB with abrupt termination 
and from 7.2 LSB with only one cross-connection at the aver- 
aging network edges). 


III. ADC ARCHITECTURE AND THA 
A. Proposed ADC Architecture 


In order to achieve the required data throughputs, time inter- 
leaving [7] is needed. It is used to relax the bandwidth require- 
ments of individual ADC blocks (except for the THA, which 
still requires the full tracking bandwidth). This leads to higher 
ADC data throughput at a lower clock rate with reduced overall 
power consumption. 

An open-loop THA with replica biasing is implemented to 
ensure the desired dynamic performance with broadband input 
signals. Interpolation is implemented at the comparator stage to 
save hardware and power consumption in preamplifiers. A cur- 
rent-mode logic (CML) comparator latch is used to lower the 
dynamic offset of the combined stage. The latch output swing is 
limited to 0.6 V (rather than 1.8 V rail-to-rail) to speed up the re- 
generation and reduce the dynamic offset. The gain required for 
suppressing the dynamic offset of the comparator is distributed 
among preamplifier stages to maximize the overall circuit band- 
width. The front-end THA decreases the bubble error proba- 
bility arising from the clock skew. However, high-speed glitches 
remain a main source for bubble errors. The signal-to-noise 
ratio (SNR) drops and the output waveform is severely dis- 
torted due to these performance limiting glitches. A 3-input 
NAND following the comparators is used as the power-efficient 
error-reduction circuitry. A ROM-based encoder maps the ther- 
mometer code to the binary code. The detailed block diagram 
of the time-interleaving ADC with averaging and interpolation 
is shown in Fig. 5. The analog signal paths use fully differential 
circuits. 


B. Track/Hold Amplifier 


At gigahertz sampling frequencies, the THA [8] is critical for 
achieving good dynamic performance over broadband input sig- 
nals. Fig. 6 shows an open-loop THA with replica-based “well- 
biasing.” Source followers in the THA utilize sufficiently large 
PMOS devices to drive subsequent preamplifiers. The output 
of a small replica source follower is used to bias the well of 
the main source follower. This has linearity advantages over a 
source follower with a well-to-source connection, without the 
disadvantage of having that output drive the nonlinear well- 
substrate capacitance. The replica consumes only 5% of the 
power of the main source follower. The low input common- 
mode voltage reduces the on-resistance of the NMOS- switches 
and increases the input tracking bandwidth (—1 dB) to about 
6.4 GHz. The dummy switches reduce the charge injection and 
the voltage glitch, thus reducing the dynamic offset. 


IV. EXPERIMENTAL RESULTS 


Fabricated in a 0.18-jzm one-poly six-metal (1P6M) CMOS 
technology, the chip microphotograph is shown in Fig. 7. The 
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CFO EATEN RIL Fae 





Fig. 7. Microphotograph of the fabricated 6-bit ADC. 


right side contains a test structure and the left side contains the 
2-GS/s 6-bit ADC. Two sub-ADCs are laid at the top and the 
bottom, respectively, with the clock generator and the buffer 
amplifier sitting at the center. For each of the sub-ADCs, from 
the left to the right, are amplifiers and digital encoders. The 
prototype ADC occupies an active chip area of 0.5 mm?. A 
decoupling capacitor of about 1 nF is used to fill the empty 
space on the die. For easy testing, the ADC chip is mounted 
on a printed circuit board (PCB) with direct die-to-board wire 
bonding. There is no decimation at the ADC outputs. For dy- 
namic analyses, the outputs from the two ADCs are combined 
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Fig. 9. Measured frequency spectrum. 


to deliver the full 2-Gword/s rate. However, for static analyses, 
outputs from the two ADCs are separated to avoid numerical 
avetaging. Fig. 8 shows the measured INL and DNL profiles. 
They are extracted from the histogram [9] of the 64 K ADC 
outputs in response to a 200-kHz sinusoidal input signal at the 
sampling rate of 2 GHz. The peak INL and DNL are recorded 
as 0.5 LSB and 0.32 LSB, respectively. This plot shows the 
systematic nonlinearity being corrected. When the input signal 
frequency increases, the peak INL and DNL remain nearly un- 
changed until near the Nyquist input frequencies. The linearity 
is then dominated by the front-end pseudodifferential THAs. 
The dynamic performance of the converter is validated in the 
frequency domain as well. The frequency spectrum of the re- 
constructed signal is shown in Fig. 9, where the input signal fre- 
quency is about 941 MHz and the clock frequency is 2 GHz. The 
0.5 fs — fin tone is about 50 dB down, which implies the gain 
and timing errors between interleaved channels do not limit the 
linearity performance of the overall ADC system. The dominant 
harmonics (second and third) are contributed by the front-end 
pseudodifferential THAs. Fig. 10 depicts the measured spurious 
free dynamic range (SFDR) and signal-to-noise-and-distortion 
ratio (SNDR) versus the input signal frequency at 2-GHz sam- 
pling rate. At the low input frequency of 4 MHz, the SNDR and 
SFDR reach 36 and 48 dB, respectively. Near the Nyquist input 
frequencies (up to 941 MHz), the measured SNDR and SFDR 
remain above 30 and 35.5 dB. The analog input range is set 
to 1.0-V peak-to-peak differential. The input capacitance of the 
ADC is about | pF. Including the front-end THAs and on-chip 
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Fig. 10. Measured SNDR and SFDR as a function of input frequency. 


clock buffers, the complete ADC consumes 310 mW of power 
from a single 1.8-V supply, while operating at 2-GHz conver- 
sion rate with input signal frequency up to 996 MHz. 


V. CONCLUSION 


A 2-GS/s 6-bit ADC with time-interleaving is demonstrated 
in a 0.18-44m 1P6M CMOS technology. A triple-cross connec- 
tion method is invented to improve the offset averaging effi- 
ciency. Open-loop THAs with replica-based well-biasing are 
realized to ensure the dynamic performance up to Nyquist fre- 
quencies. This ADC is optimized to achieve a state-of-the-art 
figure-of-merit, defined as (Power) /(2FN°® . 2 - ERBW), of 
3.5 pJ per conversion step. 
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A sinh Resistor and Its Application to tanh Linearization 


Maziar Tavakoli, Student Member, IEEE, and Rahul Sarpeshkar, Member, IEEE 


Abstract—We present a novel and simple subthreshold tunable 
resistor (sinh R) which exhibits a sinh I-V characteristic. This 
compact 8-transistor circuit generates an output current that is 
proportional to the sinh of its input differential voltage and has 
an offset-free characteristic, i.e., zero current at zero differential 
voltage, like a real resistor. In a 1.5-44m CMOS chip implemen- 
tation, we achieved a common-mode rejection ratio (CMRR) of 
46 dB. As an example application, we use the expansive properties 
of our sinh R to linearize the compressive properties of a tanh 
differential pair by degeneration and cancel all nonlinearities up 
to fifth order. We demonstrate good agreement between theory and 
experimental results. 


Index Terms—Distortion, filter, linearization techniques, sinh 
resistor, subthreshold operation, tanh differential pair. 


I. INTRODUCTION 


ESISTORS with asinh J—V characteristic could be useful 
R: various nonlinear dynamical systems. For example, 
they can be used to implement attack times in automatic gain 
control circuits that quicken for larger input transients. To the 
best of our knowledge, a transistor-level implementation of a 
tunable sinh resistor has never been reported in the literature. 
We present a compact circuit that generates an output current 
proportional to the sinh of its input differential voltage. 

Differential transconductors are essential elements in many 
analog electronic systems, such as filters, amplifiers, mixers, 
oscillators, and signal processing systems. Subthreshold differ- 
ential pairs are attractive because of their low power consump- 
tion, large tuning range, and low transconductance, which allow 
them to efficiently implement low-frequency continuous-time 
filters; for example, in the audio range (20 Hz—20 kHz). Other 
applications for subthreshold circuits include biomedical im- 
plants, sensors and sensory networks, earthquake and vibra- 
tion sensing, and low-power analog-to-digital (A/D) conversion. 
Since MOS transistors operating below threshold show an expo- 
nential /—V property, basic subthreshold differential pairs, like 
their bipolar counterparts, suffer from limited linear range and 
harmonic distortion produced by their tanh J—V transfer char- 
acteristic [1]—[3]. 

At the expense of a modest increase in area and power 
consumption, several linearizing schemes have been suggested 
in the literature to extend the input linear range of exponen- 
tial (subthreshold MOS, bipolar) differential pairs, including 
source (emitter) degeneration via resistors [4], degeneration 
via diode-connected transistors [3], source degeneration via 
single or double diffusors (MOS transistors operating in the 
subthreshold ohmic region) [5], multiple parallel asymmetric 
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differential pairs [6], [7], application of the input signal to the 
back-gate (well) terminals [1], gate degeneration [1], the use 
of a correlator or bump circuit [1], [8], or the combination of 
some of the above [9] or other [10]-[12] techniques. A larger 
linear range for a transconductor in thermal-noise limited cases 
translates to a rise in the dynamic range of filters built with 
such transconductors [1]. In this brief, we discuss how to use 
a sinh resistor to linearize a subthreshold differential pair by 
counteracting the compressive properties of a tanh with the 
expansive properties of a sinh to obtain a curve that is more 
linear than a tanh. 

The outline of this brief is as follows. In Section II, we present 
the basic idea and the design of the sinh resistor and show data 
taken from a chip. We describe the implementation and the ex- 
perimental results of a sinh-linearized tanh differential pair in 
Section III. We summarize in Section IV. 


II. sinh RESISTOR (sinh R) 
A. Basic Idea 


Fig. 1(a) shows a two-port element with an expansive [-V 
characteristic. It is composed of a MOS transistor whose drain 
voltage is shifted up by V,, and coupled back to its gate terminal. 
To intuitively understand the operation of this element, we rec- 
ognize that with a zero V,, this element essentially acts like a 
diode. This means that when its voltage (V) is increased, its 
current (/) rises either in an exponential or a square-law manner 
(depending on its regime of operation), both of which are expan- 
sive. A tunable V, allows for a tunable slope at the origin. The 
I-V curve is offset-free because zero drain-to-source voltage 
across a transistor always yields zero current. A sinh curve 
also has an (exponential) expansive quality, possesses a nonzero 
slope at the origin, and passes through the origin. 

The latter two properties are also observed in the linear re- 
sistor of Fig. 1(b) and also in the compressive two-port ele- 
ment of Fig. 1(c), in which the gate-to-source voltage (Vqs) 
of a MOS transistor has been fixed at V.. When the voltage 
across this element (Vps = V) increases from zero, the cur- 
rent through this element (/) rises from zero, gradually flattens 
out (a compressive quality), and approaches its saturation value 
(Ipssat) in an exponential or second-order fashion (based on 
its operation regime). 

The element of Fig. 1(a) constitutes the basic core of our sinh 
resistor circuit. One problem with this element is that, unlike a 
typical resistor, it cannot function in a bi-directional way. One 
easy solution to implement a bi-directional sinh resistor is to 
place two expansive elements of Fig. 1(a) in parallel and oppo- 
site directions, as demonstrated in Fig. 2(a). However, a more el- 
egant way to achieve bi-directionality, which shares bias voltage 
sources, is illustrated in Fig. 2(b). A single bias circuit deter- 
mines which side is the drain and which side is the source and 
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Fig. 2. Two different implementations of a bi-directional sinh resistor. (a) By the use of two expansive resistors in parallel and opposite directions. 


(b) By the use of a single bias circuit with source and drain inputs. 


puts the appropriate voltage on the gate. If V, is determined by 
the maximum of V; and V5 (i.e., the drain), a sinh resistor is ob- 
tained. Similarly, if the minimum of V; and Vo (i.e., the source) 
sets V,, a tanh resistor is attained. Since tanh properties have 
been extensively realized by other methods in circuit design, the 
focus of the remainder of this brief will be on the sinh resistor. 


B. Circuit Implementation 


Fig. 3(a) shows the circuit schematic of a maximum circuit 
that can function as the bias circuit required in Fig. 2(b) to re- 
alize a sinh resistor. To analyze this circuit, we note that the 
current in a subthreshold MOS transistor is given by [2], [13] 


THe = [,e(*Ves/Ur) (atettenitos rig griiontieda (1) 


In Saturation: Vps>5Ur 


Ips = T,e((sVen—Vsn)/Ur) (2) 
where Vgp, Vsp, and Vpg are the gate-to-body, source-to- 
body, and drain-to-body voltages, respectively; « is the sub- 
threshold exponential coefficient; J, is the subthreshold cur- 
rent-scaling parameter; and Uy; = kT’/q is the thermal voltage 


(about 25.9 mV at room temperature). In the simple model of 





(1), the effect of a nonzero drain-to-source conductance on Ips 
has been ignored. 

In the circuit of Fig. 3(a), if we ignore the output resistance 
of the top 73-7, pMOS current mirror, we can write (note that 
the bodies of all nMOS transistors in our n-well process are tied 
to the substrate, which is connected to ground; i.e., Vg = 0 V) 


Ty 


Ipsi tipse = fps =: 
—> e(®Vi/Ur) 4 o(eV2/Ur) — o(rVou/Ur) 


Ur 


=> Vout = — In(el*Vi/UT) 4 e(eVa/Ur)) (3) 





K 


If one of the input signals is much larger than the other one, 
(3) simplifies to Vout = max(V;, V2). In a similar approach, a 
pMOS version of this maximum circuit forms a minimum cir- 
cuit that can be used to create a tanh resistor with the topology 
of Fig. 2(b). 

Fig. 3(b) illustrates the circuit schematic of our sinh resistor 
(sinh R) based on the implementation idea of Fig. 2(b) which 
employs the maximum circuit of Fig. 3(a). The voltage V is 
equal to the maximum of V; and V9, that is, the drain side of the 
main sinh transistor Tg in Fig. 3(b). V is shifted up by a diode 
drop to set V, which is then connected to the gate terminal of 73. 
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Fig. 3. 


Thus, the expanding element of Fig. 1(a) has been successfully 
replicated, and as we show below, the body effect of Tg is also 
compensated for. 

For the sinh R circuit, we can write 


fo pel(e¥o-V)/Ur) 





Ips7 = 5 
I, Sa KV, —V KV; 
a fo (nv /Ur) 2 Ya Mio sai 
9 € a Up Ur In2 (4) 


Ieinh = 101 = T,e("V2/Ur) (e(-Vi/Ur) 2) e(-V2/Ur)), (5) 


Equation (5) can be further simplified to 
Tinga; 2 fo W/Ur) 
sinh a $ = 


x (eM /Ur) _ e(-Va/Ur)y (8), VEVou 


qo: ees oes a fetvide) ae elie) ue 


x (e(-Va/Ur) — gl-Va/Ur)), 6) 


For analytical purposes, we decompose the two inputs to our 
sinh R as V; = Vom = Vair /2 and Vo = Vom + Vain /2. 
Performing some algebra on (6) yields 


Vai 
Lesipiestlot . Ih sinh (se) where 


‘Vai a 
pre en KV diff 
Toate | cosh ( Ur : (7) 





Assuming & is close to one, we can approximate (7) to obtain 


Vai Vai 


Vaisr 

Viewing the entire circuit of Fig. 3(b) as a two-port element, 
(8) shows that the current through this element (/2;) is pro- 
portional to the sinh of the differential voltage applied to it 
(Vaire = V2 — V,); thus, we have created a resistor with a 
sinh J—V characteristic. Note that there is no current when Vai: 
is zero. The absence of Voy, in (8) implies that as long as Vaigr is 
fixed, the current has no dependence on common-mode voltage 








(8) 


= J, sinh 
an 


(b) 


Circuit schematic of (a) the maximum circuit and (b) the sinh resistor (sinh FR). 


or on the body effect of Tg, just like in a real resistor. An in- 
tuitive way to understand the common-mode rejection is that, 

for a fixed differential voltage of AV between the drain and 

the source of Tg, its current only depends on its KV, — V, or 
KV, — Va. However, since V, and V4 are connected to the max- ' 
imum circuit, no matter what their common-mode voltage level 

is, KV, — Va = KV, — V is set by I, /2, the current through 
transistor 77 [see (4)]. 

The transconductance (g,,,) of our sinh R is given by 


dlp, ly 


Vaiet Ty, A 
nm. > Se CCOS 
f dVaig Ur Fn T 


= pear = ()) 
V diff=0 Ur \ 











C. Experimental Results 


A circuit prototype of our sinh R, as illustrated in Fig. 3(b), 
was fabricated in a 1.5-~~m CMOS MOSIS n-well process. All 
the transistors had the same size (4.8 jwm/4.8 jum). The experi- 
mental tests were all run on a 5-V power supply. The common- 
mode voltage of the signals applied to the sinh R was 2.5 V, un- 
less otherwise stated. 

The current versus differential voltage (Vai) characteristic 
of our sinh R is plotted in Fig. 4(a) for three different values 
of bias voltage V; equal to 0.35 V, 0.40 V, and 0.45 V corre- 
sponding to 110 pA, 470 pA, and 1.75 nA of bias current [;, re- 
spectively. A magnified view of the curves in Fig. 4(a) near the 
origin is shown in Fig. 4(b). The theoretical fits have been cal- | 
culated using (8) for Fig. 4(a) and (9) for Fig. 4(b). We see that 
the experimental data are in good agreement with theory. How- 
ever, since « is less than one in practice, the sinh current for- 
mula of (8) and also the transconductance formula of (9) slightly 
underestimate the actual J.;,, and g,,. At large magnitudes of 
Vai (not shown in these figures), the theoretical current even- 
tually surpasses the experimental results because Tg gradually 
leaves the subthreshold exponential region and operates in the 
above-threshold square-law regime. 

Fig. 5 demonstrates the variation of sinh current with the 
common-mode de voltage (Voj,), at a fixed Vaige. We observe 
that for a Voy range of 3 V (0.5—3.5 V), the current changes 
only by a factor of 1.8. To compare, we note that such a varia- 
tion in current could have been caused by a change of only about 
15 mV in Vai. In other words, the effect of common-mode 
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voltage on the current is about 200 times weaker than the ef- 
fect of differential voltage, translating to a common-mode re- 
jection ratio (CMRR) of 46 dB. The small variation of current 
with Voy seen in Fig. 5, which is not predicted by (7) or (8), 
is due to the fact that « slightly rises with an increase in the 
common-mode voltage [1] which causes the sinh current to 
drop. The nonzero drain-to-source conductances of transistors 
also have an effect. The sudden drop of the current at two ends 
of Fig. 5 is due to transistors 7, and Tg in the maximum circuit 
of Fig. 3(a) coming out of saturation, thus disrupting the proper 
function of the circuit. 


Ill. EXAMPLE APPLICATION IN A DIFFERENTIAL PAIR 


A. Basic Idea and Circuit Implementation 


The circuit schematic of a standard CMOS source-coupled 
differential pair is shown in Fig. 6(a). Although this transcon- 
ductor uses additional current mirrors to achieve a wide output 
voltage range [2], the presence of this extra circuitry does not 
affect the arguments presented in this section regarding differ- 
ential pairs, assuming ideal mirrors. If the transistors are biased 
in subthreshold regime, the differential output current (Jou) 
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and the transconductance (G,,,) of this circuit are shown to be 


(1]-[3] 








Vix ‘Vin 
Tout = 2Lac tanh (7 ) = 2/a- tanh (a) 


2 dc v de 
AD _ 2lde _ K(2Lac) A (10) 
dVin 


> Gr = 
Vz 2Ur V 











Vin=0 


tanh is a nonlinear function, which can produce har- 
monic distortion in the signal. Also, the linear range 
(Vr = 2Ur/K ~% +75 mV) of this transconductor is too 
small for many applications where it is desirable to handle large 
inputs without distortion. 

One intuitive solution to improve the linearity problem of 
such a tanh differential transconductor is to compensate for 
the compressive properties of a tanh with the expansive prop- 
erties of a nonlinear function of its own hyperbolic kind, such 
as a sinh, to obtain a more linearized curve compared to a pure 
tanh. To this end, we simply source degenerate our differential 
pair with the sinh R developed in Section II. The circuit of such 
a sinh-degenerated CMOS differential pair is demonstrated in 
Fig. 6(b). 


B. Theoretical Analysis and Experimental Results 


Using (2) for transistors 7; and T in the circuit of Fig. 6(b) 
and calculating their current ratio, sum, and difference, we 
derive 

















7 — el(Vin/Ur)—((Ve-Vi)/Ur)) aa 
At sinn: . Lout 
= Heiseubibic hs elke 
== Garth (a oe a) 
2Ur 2Ur 
af Tout Yo-V 6Vin 
=> tanh (se) = Taps gga? (11) 
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(b) 


Fig. 6. Circuit schematic of (a) basic CMOS differential pair with wide output voltage range and (b) a sinh-linearized CMOS differential pair. 


Applying (8) to the sinh R element of Fig. 6(b), (11) is trans- 
formed to 


iL; 1 I KV; 
. =] out Hee org out = in 
tanh (3) + 5 sinh (= ) Ur 


In the compact formula of (12), it is reassuring to recognize 
that: first, with no sinh~1 term (or equivalently Iz. < Jy), (12) 
reduces to (10) which is the characteristic of a basic differen- 
tial pair, as expected; second, since the argument of the tanh~! 
function implies |Jout| < 2Jac, 2Zac sets the limiting current 
of the sinh-linearized differential transconductor in a similar 
fashion as it does for the basic differential pair of Fig. 6(a). 

To study the linearity of the transfer curve of (12), let us con- 
sider the Taylor expansions of the following functions about 
zero 





(12) 





1 LS a 
tanh”! (x) = 5 In ( J =) 


1 3 5 7 
me Ope ih ely cies 





; st 7 jain 1% 
sinh~*(y) = In (y + Y1t+ v?) 
a ae o 7 
EU Pa Ssh De are t (13) 


To obtain a maximally linear /,.44.—Vin, curve, we can Taylor 
expand (12) based on (13) and then adjust J. and J;, to eliminate 
the cubic-distortion term. This is achieved if 


1 aLyeatigi ees Tac Lac 
i (9613 * 4” 159 


If the optimal condition of (14) is satisfied, the first remaining 
nonlinearity will be due to the fifth-order term and the Taylor 
expansion of (12) reduces to 





(14) 





K Vii 


1.7942 + 0.5782° — 0.4247 +--- = ai 
4UT 


(15) 





IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 2, FEBRUARY 2005 


with « = Iout/2Zac. In comparison, the Jou+—Vin character- 
istic of a simple differential transconductor [see (10)] has cubic 
distortion 
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For example, at « = 0.5, the tanh differential pair has cubic 
distortion of 8.3%, as compared to the sinh-linearized differ- 
ential transconductor which has only fifth-order distortion of 
about 2%. Therefore, the tanh is made more linear by sinh 
degeneration. 

From (12), the transconductance (G,,,,) of our new differential 
pair is found to be 

-1 
i) 

















Gini = ( tte 
AI out 
2b K 1 
= Appt Nest a a 
Vi Smee T apt aa 
_f (Zula) 1 A (17) 
2Ur 1+ die i 


Compared to (10), (17) suggests that the G’,,, is decreased and 
thus the Vz, is increased (remember that the maximum current 
remains the same at 2/,,.) by a factor equal to 1 + Ia. /2J,. For 
the optimal case of (14), this factor is almost 1.8 (seen also in 
(15)), which makes the new Vz, equal to +135 mV. This 80% 
increase in linear range costs a 16% (i.e., I, /4a-) increase in 
power consumption. 

CMOS differential pairs without and with sinh-linearization 
were fabricated in a 1.5-j2m CMOS chip with the same size 
for all the transistors (4.8 jum/4.8 jum). Fig. 7 shows a photo- 
graph of the chip. The experimental output current versus input 
voltage de characteristics for these two circuits are plotted in 
Fig. 8(a). The data were taken with 2/4. = 10 nA. The optimal 
condition of (14) was also satisfied in the second circuit. We 
clearly see that the curve with sinh linearization has a smaller 
slope (transconductance) and a larger linear range than the one 
without. The observed improvement factor is 1.7, close to 1.8 
that theory predicts. 

To further study the linearity of the transfer curves, which is 
difficult to examine visually from the graphs in Fig. 8(a), we per- 
formed the following analysis: Having fixed 2/4. at 10 nA, we 
varied J), over a large range. For each setting, we measured the 
voltage (normalized by 2U7/«) versus current (normalized by 
214.) transfer characteristic of the sinh-linearized differential 
pair. We fit a fifth-order polynomial to each experimental curve. 
In other words, we experimentally derived the polynomial ap- 
proximation to the main formula of (12) for different J,’s. In 
Fig. 8(b), we plot the magnitudes for the coefficient of the Ist 
(linear) term and the coefficient of the 3rd (cubic) term as Iq../Iy 
is changed. We see that the minimum magnitude for the cubic 
term occurs at [q../ I, = 1.72, close to theoretical value of 1.59 
predicted in (14). 

We also configured the transconductors of Fig. 6 as two 
simple first-order low-pass G’,,,-C filters (i.e., output terminal 
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Fig. 7. Microphotograph of the chip containing the circuits of the sinhR and 
the basic and sinh-linearized differential pairs. 


connected to the negative input terminal and a capacitor) 
with cutoff frequencies (i.e., G',,/2 7C) of 4 kHz. The mea- 
sured frequency response of the filter with sinh degeneration 
is illustrated in Fig. 9. As another experimental test of our 
sinh-linearizing idea, we applied a 280mV,, (100mVims) 
passband sinusoidal signal input at 110 Hz to these filters. We 
measured the spectrum of their output signals with an SR785 
Spectrum Analyzer. We observed that the rms amplitude of the 
third harmonic in the sinh-linearized filter output is smaller 
by a factor of 27 (28.6 dB) than the same term in the standard 
tanh filter output. In fact, in our filters, the second harmonic 
was the main contributor to nonlinearity and the total harmonic 
distortion was measured at about 1%. The presence of the 
second harmonic is attributed to device mismatches (among 
our relatively small transistors), variations in « with voltage, 
and the existence of parasitic capacitors on the common source 
nodes [14], which all distort the symmetry of the /—V transfer 
curve. As is well known, employing a fully differential G,,,-C 
filter topology significantly reduces even-order nonlinearities 
[15]. In such a circuit, the effect of our sinh-linearizing scheme 
on improving nonlinearity would be substantial. 

We also briefly discuss the noise of our sinh # and sinh-lin- 
earized transconductor. Noise is important because it determines 
the lower bound on. the dynamic range. For low subthreshold 
current levels, the 1/f noise of transistors is usually negligible 
compared to thermal noise [1]. In the circuit of Fig. 3(b), the 
current noise of the sinh R is generated only by the shot noise 
of the main sinh. transistor Ts if Vps is zero. For nonzero Vps, 
the noise of the maximum circuit multiplicatively modulates the 
current flowing through 73 and is thus Vps dependent. When 
both inputs are (small-signal) grounded, the current noise power 
spectral density of Ts is 4g pssat [13], where Ip gsat is the sat- 
uration current of the transistor, given by (2). Therefore 

ty sinh — 4qIpssat 
@ gar e(eVe-Viy/Ur) Os ava, 


B= AqIgel(s¥o-(V—In2xUr/=))/Ur) 
nmsinn 
@ sgheoa/=) — 9+0/ qh, "=! 4gh, A. (18) 
2 ; 48 


Thus, in this case, the input-referred voltage noise of the 
sinh-linearized transconductor of Fig. 6(b) is found by a stan- 
dard procedure [1], [16] to be 
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(a) Experimental output current versus input voltage characteristics of the two circuits displayed in Fig. 6. (b) Magnitudes of polynomial coefficients that 


are fit to measured J—V curves of the transconductor of Fig. 6(b) for different values of Ig. /Iy. 
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where G’,,, is given by (17). With 224. = 10 nA and J; /Ig. ratio 
set according to the optimal condition of (14), the input-referred 
voltage noise was theoretically calculated to be 1.9 wV/ VHz, 
and was measured at about 1.7 pV /VHz. 

Resistive [4] and diode [3] degeneration are among the two 
most widely used linearization techniques in circuit design. The 
main shortcoming of resistive degeneration besides the imprac- 
ticality of creating large passive resistors required in circuits op- 
erating in subthreshold is that a resistor, as a linear element, 
has limited ability to oppose and improve the distortion intro- 


Experimental frequency response of the sinh-linearized transconductor of Fig. 6(b) configured as a simple first-order low-pass G.,,,-C filter. 


duced by inherently nonlinear exponential elements like tran- 
sistors. A diode-degenerated differential pair also suffers from 
the same deficiency and essentially produces the same level of 
distortion as a simple differential pair does. The sinh &, on the 
other hand, can exploit its own nonlinearity in a wise way to 
counteract and cancel unwanted nonlinearities. In Table I, we 
compare some of the characteristics of a basic, a resistive-de- 
generated, and a diode-degenerated tanh differential pair with 
those of our sinh-degenerated transconductor. We see that an 
important advantage of our scheme is that it can be utilized to 
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TABLE I 
CHARACTERISTICS OF BASIC AND VARIOUS DEGENERATED tanh DIFFERENTIAL PAIRS 


Basic 
Differential 
Pair 


Transconductor 
Characteristics 


Sinh- 
degenerated 
Differential Pair 


Diode- 
degenerated 
Differential Pair 


Resistive- 
degenerated 
Differential Pair 





Normalized to 


Linear Range (V,) 


1.8 2.4 1.8 





Cubic (3%) Harmonic 
Distortion 





Total Harmonic @ 
Distortion (THD) Lyf 21 4=0.5 


“out 





Normalized to 


Ala Vag W) 


Power 





Area/Number of 
Transistors 


Very Large! 





Effective Number of 
Noise-contributing 
Transistors (N) 





Input de Voltage 
Range 





Notes: 2J;,=10nA 
i,=3.1nA (to satisfy the optimal condition of (14)) 

















R=8.2MQ (to set the same linear range for both resistive-degenerated and sin#t-degenerated differential pairs) 


eliminate cubic distortion, a useful feature that can never be 
achieved by resistive or diode degeneration or even most of the 
other linearization schemes introduced in Section I. This quality 
results in a lower total harmonic distortion (THD) and a more 
linear J—V curve for the transconductor. However, we should 
note that, like every other engineering approach, our technique 
that has been optimized for minimal harmonic distortion does 
not necessarily exhibit the best performance in all the other rel- 
evant properties, as we observe in Table I. Our scheme can, thus, 
be used in the design of transconductor circuits in which min- 
imal distortion is of paramount interest. 


ITV. CONCLUSION 


We described the basic idea and a compact CMOS imple- 
mentation of a tunable resistor that possesses a sinh J—V char- 
acteristic. We showed that the current of such a resistor de- 
pends only on the sinh of its input differential voltage, not on its 
common-mode value, just like a-normal resistor. We presented 
and justified experimental results that were in good agreement 
with our theoretical predictions. As an example application, we 
utilized our sinh R to degenerate a compressive subthreshold 
tanh differential pair and adjusted the circuit to cancel the cubic 
distortion introduced by a pure tanh curve which effectively 
widens the linear range by 80%. We also confirmed the effec- 
tiveness of our linearization technique in a first-order G’,,-C 
filter where we reduced the third harmonic distortion by a factor 
of 27. The achieved extra linearity and its consequent drop in 
distortion are desirable qualities in many applications for differ- 
ential transconductors, such as filters, mixers, and amplifiers. 
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An Ultra-Wideband CMOS Low Noise Amplifier for 3—5-GHz UWB System 


Chang-Wan Kim, Min-Suk Kang, Phan Tuan Anh, Hoon-Tae Kim, and Sang-Gug Lee 


Abstract—An ultra-wideband (UWB) CMOS low noise amplifier 
(LNA) topology that combines a narrowband LNA with a resistive 
shunt-feedback is proposed. The resistive shunt-feedback provides 
wideband input matching with small noise figure (NF) degradation 
by reducing the Q-factor of the narrowband LNA input and flattens 
the passband gain. The proposed UWB amplifier is implemented 
in 0.18-44m CMOS technology for a 3.1-5-GHz UWB system. Mea- 
surements show a —3-dB gain bandwidth of 2—4.6 GHz, a min- 
imum NF of 2.3 dB, a power gain of 9.8 dB, better than —9 dB 
of input matching, and an input IP3 of —7 dBm, while consuming 
only 12.6 mW of power. 


Index Terms—Broadband, CMOS, feedback, low noise ampli- 
fier, RF, ultra-wideband. 


I. INTRODUCTION 


ECENTLY, the interest in ultra-wideband (UWB) system 

for wireless personal area network (WPAN) application 
has increased significantly, though the international standard has 
yet to be finalized. The allocated frequency band of the UWB 
system is 3.1—10.6 GHz (low-frequency band: 3.1—5 GHz; high- 
frequency band: 6—-10.6 GHz). Two recent major proposals [1], 
[2] for the IEEE 802.15.3a propose that data rates of up to 
400-480 Mb/s can be obtained using only the low-frequency 
band. The low-frequency band has been allocated for the devel- 
opment of the first-generation UWB system. CMOS technology 
is a satisfactory choice for the implementation of the low band 
UWB system when considering the time to market, hardware 
cost, the degree of difficulty, etc. 

Until now, reported CMOS-based wideband amplifiers tend 
to be dominated by two different topologies: the distributed 
and resistive shunt-feedback amplifiers. The distributed ampli- 
fiers [3], [4] normally provide wide bandwidth characteristics 
but tend to consume large dc current due to the distribution of 
multiple amplifying stages, which makes them unsuitable for 
low-power application. The resistive shunt-feedback-based am- 
plifiers [5], [6] provide good wideband matching and flat gain, 
but tend to suffer from poor noise figure (NF) and large power 
dissipation. In the resistive shunt-feedback amplifier, input re- 
sistance is determined by the feedback resistance divided by 
the loop-gain of the feedback amplifier [7]. Therefore, the feed- 
back resistor tends to be a few hundred ohms in order to match 
the low signal source resistance of typically 50 (2, leading to 
significant NF degradation. Furthermore, even with a moderate 
amount of voltage gain, the amplifier requires a rather large 
amount of current, especially in the CMOS, due to its strong 
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Fig. 1. Narrowband LNA topology. (a) Overall schematic. (b) Small-signal 
equivalent circuit at the input. 


dependence for voltage gain on the transconductance of the am- 
plifying transistor. Recently, a new topology of a wideband am- 
plifier for UWB system, which adopts a bandpass LC filter at 
the input of the cascode low noise amplifier (LNA) for wideband 
input matching, has been reported in [8] and [9]. The bandpass 
filter-based topology incorporates the input impedance of the 
cascode amplifier as a part of the filter, and shows good perfor- 
mances while dissipating small amounts of dc power. However, 
the adoption of the LC filter at the input mandates a number of 
reactive elements, which could lead to a larger chip area and NF 
degradation in the case of on-chip implementation, or the addi- 
tional external components. 

This paper proposes a new low power, low noise, and wide- 
band amplifier combining a narrowband LNA with the con- 
ventional resistive shunt-feedback. The design principles and 
the measurement results of the implemented 3.1—-5-GHz UWB 
LNA are described. 


II. DESIGN OF WIDEBAND AMPLIFIER 


Fig. 1(a) shows a typical narrowband cascode LNA topology. 
In Fig. 1(a), the inductor L, is added for simultaneous noise and 
input matching and L, for the impedance matching between the 
source resistance R, and the input of the LNA [10]. Fig. 1(b) 
shows the small-signal equivalent circuit for the input part of the 
overall LNA, where C,,, represents the gate-source capacitance 
of the input transistor 1/,. In Fig. 1(b), a series combination 
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Fig. 2. UWB LNA topology. (a) Overall schematic. (b) Small-signal 
equivalent circuit at the input. 


of reactive elements is chosen to resonate at the frequencies of 
interest such that Z;,, becomes a real value with wrL, being 
equal to R,. The wr represents the cutoff frequency of transistor 
My,. The quality factor Q of the series resonating input circuit 
shown in Fig. 1(b) can be given by [11] 


1 


dh USE ENS aE ge ee 1 
(Re Saabs one ()) 


QnB 


where wo represents the resonant frequency. With a typical 
LNA, the Q-factor shown in (1) is generally preferred to be high 
for high-gain and low-noise performance while dissipating low 
dc power. Since the fractional —3-dB bandwidth of a typical 
RLC series resonant circuit is inversely proportional to its 
Q-factor (BW _3an = wo/Qwnzp), the LNA shown in Fig. 1(a) 
is unsuitable for wideband application. 

Fig. 2(a) shows the proposed wideband LNA topology. In 
Fig. 2(a), Ry is added as a shunt-feedback element to the con- 
ventional cascode narrowband LNA and Lj aq is used as shunt 
peaking inductor at the output [12]. The capacitor Cy is used 
for the ac coupling purpose. The source follower, composed of 
Mz and Mg, is added for measurement proposes only, and pro- 
vides wideband output matching. C; and C are ac coupling 
capacitors. 

Fig. 2(b) shows the small-signal equivalent circuit for the 
input part of the proposed wideband LNA. In Fig. 2(b), the re- 
sistor Reas[= Ry/(1 — A,)] represents the Miller equivalent 
input resistance of R, where A, is the open-loop voltage gain 
of the LNA. From Fig. 2(a) and (b), the value of R can be much 
larger than that of the conventional resistive shunt-feedback. In 
the conventional resistive shunt-feedback, the size of Fy is lim- 
ited as Rj determines the input impedance. However, in the 
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—— with feedback resistor R,(=1.5 k&) 


eseee Without feedback resistor R, 


Fig. 3. Simulated 5, traces of LNA with or without the feedback resistor for 
frequencies over 3-5 GHz. 


proposed topology, the input impedance is determined by wr L,. 
Therefore, in Fig. 2(a), one of the key roles of the feedback re- 
sistor P+ is to reduce the ()-factor of the resonating narrowband 
LNA input circuit. The Q-factor of the circuit shown in Fig. 2(b) 
can be approximately given by 


1 


Rs + uzLlg + peer] ‘Wo ae 





Qwes & (2) 


From (2), and considering the inversely linear relation between 
the —3-dB bandwidth and the Q-factor, the narrowband LNA 
in Fig. 2(a) can be converted into a wideband amplifier by the 
proper selection of Ry. 

For example, to design a wideband amplifier that covers a 
certain frequency band, the narrowband amplifier will be opti- 
mized at the center frequency. Then, the —3-dB bandwidth of 
the small-signal equivalent input circuit can be set by the proper 
selection of Ry. Depending on the amount of bandwidth, the 
required value of Ry can vary and so will the amount of noise 
contribution by Ry. Fig. 3 shows the simulated 5;, of the de- 
signed UWB amplifier with 2 ¢(= 1.5 kQ) and compares that of 
the amplifier without the feedback resistor R. As can be seen 
in Fig. 3, compared to the narrowband case, the addition of R- 
gathers the values of passband 5; closer to the center of the 
Smith chart, leading to wideband input matching. The feedback 
resistor /¢f also provides its conventional roles of flattening the 
gain over a wider bandwidth of frequencies with much smaller 
noise figure degradation. 


III. AMPLIFIER DESIGN AND MEASUREMENT RESULTS 


The proposed topology shown in Fig. 2(a) is applied to 
a 3.1-5-GHz wideband amplifier based on 0.18-j4m CMOS 
technology. The narrowband LNA is optimized at 4 GHz by the 
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Fig. 4. Measured power gain, input/output return loss, and reverse isolation of 
the UWB LNA. 


proper selection of the values for L, and L,. With feedback re- 
sistor +, the bandwidth extends to cover 3-5 GHz. In Fig. 2(a), 
the input transistor M,(W/L = 320/0.18 yum) is biased at 
7 mA. The size of the cascode transistor M2(240/0.18 ym) 
is decided considering a trade-off between gain (52) and 
—3-dB bandwidth. The value of the on-chip spiral inductor 
Ljoaa is 2.4 nH, and its quality factor (Q) is about 9.5 at 5 GHz. 
The source follower, which consists of M/3(80/0.18 zm) and 
M,(40/0.35 zm), consumes 2 mA. Although Ry = 1.5 kQ is 
optimal from the simulation results due to the respectable noise 
performance, the value of Ry is adjusted as 1 kQ in order to 
guarantee wideband input matching. In Fig. 2(a), the inductors 
L, and L, are implemented as external components with a 
value of 0.6 nH and 2.5 nH, respectively. These inductors can 
be absorbed as a part of the package parasitics, but in this work 
they are implemented with bond wires due to the chip-on-board 
(COB) evaluation of the fabricated chip. Other component 
values are C; = Cy = 2 pF, Co = 4 pF, and Rioag = 502. 
For the evaluation, from Fig. 2(a), the de biasing nodes V1, 
Vo2, and Vpp1 = Vppz2 are biased separately through external 
voltage sources. Fig. 4 shows the measured S-parameters of the 
designed UWB amplifier. As can be seen in Fig. 4, the measured 
input return loss (.5;;) is higher than 9.0 dB over a 3-5-GHz 
range. The output return loss (S22) is higher than 11 dB for the 
same frequency range due to the source follower output stage. 
The maximum power gain (521) is +9.8 dB and the —3-dB 
bandwidth covers 24.6 GHz. In Fig. 4, the amplifier shows 
early power gain roll off near 4.6 GHz compared to the sim- 
ulated value of 5 GHz. This is caused by the increase in value of 
the peaking inductance due to the addition of external bonding 
wires to the supply voltage, which had not been counted prop- 
erly during the simulation. As can be seen from Fig. 4, the re- 
verse isolation (5,2) approaches the 20-dB range due to the 
feedback network. Considering the reverse isolation provided 
by the source follower stage, the amount of reverse isolation is 
worse than expected. Fig. 5 shows both the measured and simu- 
lated NF of the implemented amplifier. The measured NF shows 
a minimum value of 2.3 dB at 3 GHz and stays at less than 3 dB 
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Fig. 5. Measured and simulated NF of the UWB LNA. 





Fig. 6. Microphotograph of the fabricated UWB CMOS LNA. The inductors 
L, and L, are implemented as external components. 


up to 4 GHz, but rises up to 5.2 dB at S GHz. Compared to the 
simulation, the steep increase in NF near 5 GHz is caused by 
the lower power gain at these frequencies. The discrepancy in 
NF between the simulation and measurements at the 2-4-GHz 
range is the result of inaccuracies in the transistor noise model. 
From the simulation, the feedback resistor Rs degrades the am- 
plifier NF to approximately 0.6 dB. The input referred IP3 is 
measured as —7 dBm for the two-tone signals of 4 GHz and 
4.5 GHz. Fig. 6 shows the microphotograph of the fabricated 
CMOS UWB LNA with a chip size of 0.9 mm?. Table I sum- 
marizes the measurement results and compares them with previ- 
ously reported works. In Table I, the indicated amount of power 
dissipation for this work represents the power dissipated in the 
cascode topology only. 


IV. CONCLUSION 


A new CMOS UWB LNA, applied to the lower band 
(3.1-5 GHz) UWB system, is presented. The proposed ampli- 
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TABLE I 
COMPARISON OF WIDEBAND CMOS LNA PERFORMANCES: PUBLISHED AND THE PRESENT WORKS 










































































: ae 
pcr, (dB) (dBm) HN Topology Technology Year 
Distributed 
‘ B.7 be 6 JIMOs 2 
[3] BT 83.4 (éinwio-oncled) 0.6 nm CMOS 000 
[4] 0.6~22 | <-8 8.1 4.3 52 Distributed | 9 184m CMOS | 2003 
(single-ended) 
Feedback 
[5] 0.02~16| <-8 | IssIbOL 0 35 Guhaatatie g)| 9-25m CMOS | 2002 
[6] Le 7 <.7.2 | 13.1 3.3 14.7 715 ea 0.18 um CMOS | 2003 
{8} 24~95 | <99 | 93 4 6.7 | 9* iene 0.18 um CMOS | 2004 
[9] 2~10 <-10 21 2.5 55 Pee | SiGe 2004 
(single-ended) 
This work] 2~ 4.6 <-9 9.8 2.3 1 | 12.6*|, Proposed | oisgimncmos| 2004 
(single-ended) 





** Only core LNA 


* Minimum NF in pass band 


fier topology adopts the conventional resistive shunt-feedback 
onto a narrowband LNA topology. In the proposed topology, 
the wideband characteristics are obtained by utilizing the 
feedback resistor as a component to reduce the Q-factor of 
the narrowband amplifier input impedance. The feedback 
resistor helps to extend the bandwidth of the amplifier as well 
as the gain flatness, while contributing a small amount in NF 
degradation. The adoption of the narrowband amplifier allows 
lower amounts of de power dissipation. The proposed topology 
is applied for a 3.1-5-GHz UWB amplifier implementation 
based on 0.18-jzm CMOS technology. The measured results 
shows more than 9 dB of input return loss, a higher than 11 dB 
output return loss, a peak gain of 9.8 dB over the —3-dB 
bandwidth of 2—-4.6 GHz, while dissipating 7 mA from a 1.8-V 
supply. The minimum NF is 2.3 dB at 3 GHz and stays at less 
than 3 dB up to 4 GHz, but rises up to 5.2 dB at 5 GHz. The 
proposed LNA shows advantages in overall performance (NF, 
power gain, power dissipation, chip size, number of external 
components, etc.), compared to the distributed, conventional 
shunt-feedback, or filter-based amplifiers that make up other 
wideband topologies. 
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CMOS Wideband Amplifiers Using Multiple 
Inductive-Series Peaking Technique 


Chia-Hsin Wu, Student Member, IEEE, Chih-Hun Lee, Wei-Sheng Chen, and Shen-Iuan Liu, Senior Member, IEEE 


Abstract—This paper presents the technique of multiple in- 
ductive-series peaking to mitigate the deteriorated parasitic 
capacitance in CMOS technology. Employing multiple induc- 
tive-series peaking technique, a 10-Gb/s optical transimpedance 
amplifier (TIA) has been implemented in a 0.18-~m CMOS 
process. The 10-Gb/s optical CMOS TIA, which accommodates 
a PD capacitor of 250 fF, achieves the gain of 61 dBQ and 3-dB 
frequency of 7.2 GHz. The noise measurement shows the average 
noise current of 8.2 pA/\/Hz with power consumption of 70 mW. 


Index Terms—Inductive-series peaking, transimpedance ampli- 
fier, wideband amplifier. 


I. INTRODUCTION 


ITH the rapid proliferation of numerous multimedia 

networking applications, wideband high-speed telecom- 
munication systems, such as 10-Gb/s optical fiber-link appli- 
cations, are required. These high-speed front-end circuits 
[1], [2] are required to be high frequency, low cost, and low 
power dissipation. However, CMOS devices pose difficult 
design challenges, such as severe parasitic capacitance, lower 
transconductance, and noise performance, which mandate 
circuit innovations to tackle with these issues. 

The purpose of this paper is to introduce multiple induc- 
tive-series peaking technique to overcome the limitations of 
CMOS technology. This technique can significantly extend 
circuit bandwidth without penalty of power consumption. 
Meanwhile, it can have a relatively flat frequency response 
similar to LC-ladder filters. A 10-Gb/s optical transimpedance 
amplifier (TIA) has been implemented in 0.18-~m CMOS tech- 
nology to demonstrate the technique of bandwidth extension. 

The design of a TIA should meet stringent constraints, such 
as gain, bandwidth, noise, and dynamic range. With a typical 
received power of —15 dBm and a photodiode of responsibility 
of about 0.75 A/W, TIA must afford more than 1 kQ. (60 dBQ)) 
transimpedance gain to amplify the weak input current to a de- 
tectable signal level for the succeeding stage, such as limiting 
amplifier [3]. Besides, dynamic range has been a critical issue 
especially for optical fiber links applications. For low-speed op- 
tical interconnects, inverter-configuration TIA has been widely 
adopted [4]. Nevertheless, for high-speed optical fiber link ap- 
plication, such as more than 2.5 Gb/s, inverter-configuration 
TIA is seldom used due to its low-speed property. In this paper, 
the inverter-configuration TIA employing the multiple induc- 
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tive-series peaking technique has been exploited up to 10—Gb/s 
in CMOS technology, which also possesses low-power and area- 
efficient features. 

The paper is organized as follows. Section II introduces the 
proposed multiple inductive-series peaking technique. The cir- 
cuit designs and schematics are also described in this section. 
Section III presents experimental results of the TIA. Finally, 
conclusions are given in Section IV. 


II. MULTIPLE INDUCTIVE-SERIES PEAKING TECHNIQUE 


The proposed wideband amplifier architecture is shown in 
Fig. 1(a), where on-chip inductors have been deployed between 
gain stages. Without employing inductors, amplifier bandwidth 
is mainly determined by RC time constants of every node. 
In CMOS technology, severe parasitic capacitance deterio- 
rates bandwidth significantly. In the proposed architecture, 
between gain stages, deployed inductors and parasitic capaci- 
tances resemble as a third-order LC-ladder filter to perform an 
impedance transformation network [5], [6]. 

Considering the inter-stage small-signal model without an in- 
ductor in Fig. 1(b), the transfer function can be expressed as 


Vout xe! Fee Gia Rr 


= —__——_ 1 
Vin 1+s8CrRr vi 





where Ry denotes Rri//Re2, and Cp represents Cy + Co. 
Ryi/Ry2 and C;/C2 denote equivalent resistors and capaci- 
tors contributed by previous and next stages, respectively. The 
transfer function of Fig. 1(b) can be derived as shown in (2) 
at the bottom of the next page. Fig. 2 shows the simulated fre- 
quency responses of the first- and third-order filters with dif- 
ferent inductances from 0.47 to 1.6L7, where Ly denotes the 
optimal inductance value, C; = C2, and Rr; = Rez. The simu- 
lation results show using smaller inductance can improve band- 
width further but also introduce larger peaking magnitude to de- 
teriorate step response. Employing a proper inductance value 
Lr with an acceptable overshoot peaking, it can be found that 
the 3-dB bandwidth of the proposed topology is 2.5 times than 
that without inserting inductors, The bandwidth-extension ef- 
fect of proposed technique is more apparent for cascading more 
stages. Fig. 3 shows the simulated 3-dB bandwidths of wideband 
amplifiers with different cascading stages, where 3-dB frequen- 
cies have been normalized with respect to the 3-dB frequency 
of first-order RC filter. It is shown that the 3-dB bandwidth of 
the proposed amplifier is 6 times than that of a conventional am- 
plifier, which is a quite large factor. The bandwidth of conven- 
tional wideband amplifiers is significantly degraded with cas- 
cading more stages. However, that of the proposed wideband 
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Fig. 1. (a) Proposed wideband amplifier structure. (b) Equivalent inter-stage small-signal model of the proposed amplifier. 
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Fig. 2. Comparison between first- and third-order filters with different 


inductance value. 


amplifier utilizing multiple inductive-series peaking technique 
is not obviously degraded with cascading more stage, which in- 
dicates that the gain and bandwidth trade-off can be ameliorated 
by the technique. 

The proposed TIA is shown in Fig. 4, where on-chip inductors 
and M-derived half circuits have been employed. Photodiode 
capacitance, which usually performs the dominant pole, and 
parasitic capacitances can be absorbed as a part of impedance 
transformation network by utilizing the multiple inductive-se- 
ries peaking technique. However, the filter structure performs 
considerable frequency dependence. If terminated to resistive 





Vout an 


loads directly, the mismatch will deteriorate the filter signifi- 
cantly. To circumvent this issue, M-derived half circuits, which 
exhibit more uniform impedance, have been utilized in input 
and output matching networks [7]. The circuit simulation re- 
sult is depicted in Fig. 5(a), which shows the 3-dB frequency of 
conventional 5-stage inverter-configuration TIA is 2.4 GHz, and 
the 3-dB frequency of the proposed TIA is 7.4 GHz, which is 
3 times larger than the conventional one. Considering trade-offs 
between noise and inter-symbol interference, the bandwidth is 
commonly determined by 0.7—0.8 times data rate, hence the sim- 
ulated bandwidth is sufficient for 10-Gb/s optical fiber link ap- 
plication. Fig. 5(b) shows the simulated gains with different in- 
ductor series resistance. It is shown that circuit performance is 
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insensitive to inductor quality factor. With 50% reduction of in- 
ductor quality factor, the gain reduces 2 dB and bandwidth only 
decreases 3%. Compared to the inductive shunt-peaking tech- 
nique, which is very sensitive to stray capacitance induced by 
spiral inductors, the proposed TIA manifests larger bandwidth 
enhancement and more insensitivity to on-chip inductor quality 
factor. 


III. EXPERIMENTAL RESULTS 


The proposed TIA has been implemented in 0.18-jzm CMOS 
technology and measured in on-wafer testing. Fig. 6 shows the 
die photo. To accurately demonstrate the capability of accom- 
modating PD capacitance and load capacitance, two 250-fF 
MIM capacitors have been integrated on this chip. Ascribed 
to be insensitive of inductor quality factors, miniature 3-D 
inductors have been adopted to further minimize die area [8]. 
The core circuit area is only 0.14 mm?, which is almost equal 
to a 5-nH planar inductor. 

Fig. 7 shows the measured gain and group delays. The mea- 
sured gain is 61 dBQ) and 3-dB frequency is 7.2 GHz. Within 
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Simulation results (a) Gains of conventional and proposed TIAs. (b) Proposed TIA’s gain versus inductor’s series resistance. 




















Fig. 6. Die photo of the area-efficient TIA. 


3-dB bandwidth, the average group delay is 275 ps with ripple 
of about 25 ps. Fig. 8 shows the measured average input equiv- 
alent noise current density of 8.2 pA/VHz. 


The measured eye diagrams with 2°! — 1 PRBS have been 
depicted in Fig. 9. The measured output eye diagram is still 
well open at larger input current of 3.1 mA. Compared to a re- 
sistive feedback TIA, the inverter-configuration TIA possesses 
superior capability to accommodate larger input current. The 
proposed TIA is well suitable to optical fiber link applications, 
which needs wide dynamic range requirement. 
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TABLE I 
SUMMARY OF MEASURED PERFORMANCE AND BRIEF COMPARISON WITH STATE-OF-THE-ART PUBLICATIONS 


0.18um CMOS 0.25um BiCMOS 
sv 
1.12kQ 500 Q 13k Q 


0.25 pF 0.5 pF 0.15 pF 
Oe eee 














Reference 







Process 





Supply Voltage 
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-3dB Bandwidth 
















PD Capacitance 










Sensitivity -17dBm (Pin) 


Input Equivalent 
Ey oe 9.5pAN Hz 
Noise 


Power 





Dissipation 
Chip Area 








N/A 


eliminating power-hungry intermediate and output buffers. This 
fully integrated TIA demonstrates the efficiency of chip area and 
power consumption, only 0.14 mm? and 70.2 mW with a single 
1.8-V supply. 


© Averaged Input Noise Current 
Smoothed Curve 


IV. CONCLUSION 


A. bandwidth-extension technique called multiple induc- 
tive-series peaking technique has been introduced in this paper. 
A 10-Gb/s CMOS TIA has been presented to demonstrate 
0 4 2 3 4 5 6 7 8 g 19 the bandwidth-extension technique. Employing the multiple 

Frequency (GHz) inductive-series peaking technique, the CMOS TIA reported 
here achieves gain of 61 dBQ with bandwidth of 7.2 GHz. 
The measured results demonstrate that the proposed technique 
of bandwidth extension can improve bandwidth performance 

Measured results and the brief comparison with the state-of- _ significantly. The proposed technique of bandwidth extension is 
the-art 10-Gb/s TIA publications are summarized in Table I. suitable for CMOS devices to achieve wideband and low-power 
A low-voltage and low-power operation can be achieved by characteristics simultaneously. 
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Fig. 8. Measured input equivalent noise current density. 
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60-GHz SOI CMOS Traveling-Wave Amplifier 
With NF Below 3.8 dB From 0.1 to 40 GHz 


Frank Ellinger, Member, IEEE 


Abstract—In this paper, the design and the results of a CMOS 
traveling-wave amplifier (TWA) optimized for minimum noise 
figure is presented. Design tradeoffs and optimization guidelines 
for maximum operation frequency, gain and minimum noise are 
discussed by means of analytical calculations and simulations. 
The MMIC is fabricated using digital 90-nm silicon on insulator 
(SOD technology and requires a chip area of only 0.3 mm?. At a 
supply voltage of 2 V and a supply current of 66 mA, a gain of 
9.7 dB+1.6 dB is measured over a frequency range from 10 to 
59 GHz. Toward dc, the gain increases up to 16 dB. The unity gain 
cutoff frequency is 71 GHz. At 20 and 40 GHz, the circuit has a 
1-dB output compression point of 12.5 and 9.5 dBm, respectively. 
From 0.1 to 40 GHz, a noise figure below 3.8 dB is measured. 
The results are achieved at source/load impedances of 50 {2 and 
include the pad parasitics. To the author’s knowledge, the TWA 
has by far the lowest noise figure achieved for a silicon-based 
amplifier with comparable bandwidth. 


Index Terms—CMOS, low-noise amplifier, millimeter-wave fre- 
quency, MMIC, SOI, traveling-wave amplifier. 


I. INTRODUCTION 


VER the last years, the speed gap between leading-edge 
Ow and CMOS technologies has been significantly 
decreased. Recently, a SOI CMOS technology with transit 
frequency (f;) of 243 GHz and maximum frequency of oscil- 
lation (fmax) of 208 GHz has been reported [1]. Compared 
to conventional bulk technology, the implementation of a thin 
isolation layer between the active transistor area and the sub- 
strate allows a higher substrate resistivity without degrading 
the threshold properties of the MOSFETs. Consequently, the 
parasitics of the transistors and the passive devices are reduced 
thereby increasing their speed and ( factor, respectively. 

Analog circuits such as a 26-42-GHz low-noise amplifier 
[3], a 30-40-GHz mixer [4], a 52-62-GHz oscillator [5] and 
a 26.5-28.5-GHz frequency doubler [6] have been designed, 
demonstrating the suitability of SOI CMOS technologies for 
analog applications at millimeter-wave frequencies. 

Wideband amplification is important for many systems such 
as ultra-wideband (UWB) transceivers, measurement equip- 
ment, and optical communication. The excellent bandwidth 
performance of TWAs is well known [7]. In contrast to cascaded 
amplifier topologies, the gain of the traveling-wave amplifier 
(TWA) stages is added instead of multiplied. Thus, TWAs pro- 
vide a relative low gain. However, due to the incorporation of the 
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parasitic capacitances of the amplifier stages into artificial trans- 
mission lines, very high bandwidths can be achieved. Recently, a 
SOI TWA has been reported yielding a gain of 5 dB up to a very 
high operation frequency of 91 GHz [8]. 

In this paper, a TWA is presented, which was optimized 
for minimum noise and maximum gain up to 40 GHz. The 
circuit was fabricated on very large scale integration (VLSI) 
SOI CMOS technology optimized for digital rather than for 
analog applications. With a noise figure below 3.8 dB from 
0.1 to 40 GHz, the presented TWA significantly improves the 
state-of-the-art noise performance of CMOS wideband ampli- 
fiers operating at millimeter-wave frequencies. The result is 
close to the one achieved with leading-edge III/V technologies. 
As an example, a TWA on metamorphic HEMT technology has 
been reported providing a noise figure below 3.7 dB from 5 to 
40 GHz [10]. A comparison with other state-of-the-art TWAs 
is shown in Table I. 


Il. MODELING 


The TWA was fabricated on experimental 90-nm IBM VLSI 
SOI CMOS technology featuring a metal stack with 8 metals 
and a substrate resistivity of 13.5-+45 Qcm. Detailed information 
about the technology can be found in [1]-[6]. 

In Fig. 1, the small-signal and noise model of the n-channel 
FETs with gate width w, of 64 jum is shown. It is applied in 
the HP advanced design system (ADS). The measured and 
simulated S-parameters and the 50-(2 noise figures are com- 
pared in Fig. 2. The device is biased in class-A operation with 
a drain-source voltage of Vz, = 1 V, a gate-source voltage 
of Vj; = 0.5 V, and a corresponding drain-source current of 
Iu; = 17 mA. In this bias point, a f; of 147 GHz and a fmax 
of 150 GHz were extracted. The transistors have a threshold 
voltage of approximately 0.27 V and a drain-source breakdown 
voltage well above 1 V. At 26 GHz, the FETs have a NF pin of 
approximately 1.1 dB [2]. 

Inductive transmission lines with an inductance per length 
of approximately 0.7 nH/mm and a loss of around 1.8 dB/mm 
are used. To minimize the parasitic ground capacitances and to 
allow high resonance frequencies in the range of 100 GHz, no 
ground shields are used. For further information about the in- 
ductive lines, the reader is referred to [3]. 


Ill. CircuIT DESIGN 


In Fig. 3, the circuit schematics of the designed TWA is 
shown. The input signal travels down the input line, feeding 
each amplifier. Undesired reflections are absorbed by the 
termination resistors Rag and Rayq. Given that the phases 
of the input and output lines are equal, the amplified signals 
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TABLE I 
STATE-OF-THE-ART TWAS 


Technology/fmax Min. Gain 


Operation BW* 





Piap NF Pac Chip area | Ref. 





III/V based technology 




















0.6m GaAs MESFET/18GHz _| 12GHz 7dB na. n.a. 4Vx19mA 136mm | [il] | 

0.1m metamorphic HEMT/n.a. 40GHz 14dB 4dBm@ 20GHz <3.7dB 5-40GHz 3.5Vx143mA 6.3mm” [10] 
na. na. 250mW 084mm" | [12] 

0.1j1m InP HEMT/300GHz 112GHz 4dB n.a. n.a. n.a. 2.2mm [13] 





Silicon based technology 



















































































0.5um CMOS/n.a. 11.8dBm@5GHz 5.5dB @2GHz 3VxX27mA_—«|-0.79mm” ~‘| [14] 
0.18u4m CMOS/n.a. 8dB n.a. n.a. n.a. 2.34mm* [15] 
BJT/70GHz 15GHz | 8.7dB | n.a. 8.3dB @8GHz n.a. 7.5mm [16] 
0.18,4m CMOS/n.a. 22GHz 6.5dB na. 6.1dB@18GHz =| 1.3Vx40mA 135mm | [17] 
SiGe HBT/100 GHz 81GHz n.a. n.a. | 5.5Vx35mA 2.21mm* 
0.12,1m SOI CMOS/200GHz n.a. na. 2.6Vx35mA__| 0.82mm 


0.3mm This 
work 


*Lower frequency depends on external decoupling capacitors (BW: bandwidth). 





Fig. 1. Small signal and noise model of MOSFET at V,, = 0.5 V,Vas =1V 
and Iz, = 17 mA, transconductance g,, = 82 mS, drain-source resistance 
Ra; = 67 Q, drain inductance Lg = 35 pH, gate resistance Rz = 32, 
gate-source resistance Rj, = 20 (), gate leakage resistance R, = 10 kQ), 


gate-source capacitance C',, = 60 fF, gate inductance L, = 30 pH, gate-drain 
capacitance C',q = 20 fF, drain-source capacitance Ca, = 15 fF, drain noise 
current source J,,q = 45 pA, gate noise current source V,,, = 200 pV. 



























— Measured 
erest Simulated 











f [GHz] 


Fig. 2. Comparison between measurements and simulations of MOSFET at 
Vas = 0.5 V, Vas = 1 V, Ia; = 17 mA. (a) S-parameters 2-100 GHz. 
(b) Noise figure at 50 Q (NF 50a). 


are constructively added at the output line. This is the case 
when the values for the inductance L and the capacitance C’ 
of the input and output lines are equal. For simplification, it is 
assumed that the feedback from the input to the output of the 
amplifiers Sj. and the parasitics of the inductors can be ne- 
glected. Consequently, the capacitance of the distributed line is 
determined by the input capacitance C;,, of the amplifier, which 


typically is larger than the output capacitance. An additional 
shunt capacitance can be added at the output of the amplifier 
stages to obtain equal capacitances and phase conditions. 

Common-gate and common-drain stages are not well suited 
for the TWA amplifier stages, since they have resistive rather 
than capacitive input and output impedances, respectively, 
thereby causing high line losses. Cascode amplifier stages 
as illustrated in Fig..4 were used for the designed TWA, 
since compared to common-source stages, they provide a 
significantly higher output impedance with a value above 
GmR?,, = 450 Q, which is approximately 6 times larger than 
the one of a common-source stage using the same transistor. 
Due to this high value related to the line impedance of 50 2, the 
output resistance can be neglected for theoretical calculations 
simplifying the analysis. The resistive output losses of the 
amplifier can be reduced and the gain can be increased. This is 
demonstrated in Fig. 5, where the measured power gain of the 
common-source and the common-gate stages are compared. 
At 40 GHz, the common-source stage provides a maximum 
stable gain (MSG) of 10 dB, whereas the cascode stage yields 
a higher MSG amounting to 17.5 dB. 

The characteristic impedance and the 3-dB cutoff frequency 
of a distributed line section can be approximated by 





L 
40 - Cai (1) 
and 
1 
= 2 
fe = (2) 


with L as the line inductance. The choice of w, is a tradeoff 
between desired g,,, and corresponding power gain per stage on 
one side, and maximum f, on the other side. We can determine 
the maximum C;,, and the associated w, for a desired f.. The 
design goal of this work was to achieve an operation frequency 
of at least 40 GHz. To ensure that the f.. of the transmission line 
sections is well above this frequency, we chose a f,. of 70 GHz. 
With a Zp of 50 2, we obtain a wy of 64 jum and a L of 225 pH. 

The power gain of TWAs is limited by the gate line, drain line, 
and inductive line losses. As discussed before, the losses of the 
inductive lines are relatively small. It has been shown that the 
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Fig. 4. Simplified circuit schematics of cascode amplifier stage, 
Crr shunt2 = 5 pF, Rpias = 6kO, Voo'= 1.5. V, Va, of each FET +1 V. 
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Fig. 5. Measured stable gain (MSG) and maximum available gain (MAG) of 
common source (CS) and cascode (CC) amplifier stages. 


losses are mainly determined by the gate line losses [19]. This 
is especially the case for TWAs using cascode amplifier stages 
with high output resistance. If the losses of the drain line and 
the inductors are neglected, the small-signal power gain can be 
approximated by 


G = Go(1 — nA,)” (3) 
with the low-frequency gain 

5 

N*Gm* Zo \~ 
Goi (S#- 7) (4) 

2 

and the gate loss factor ° 
An = 5G Ru? Ci Ze: (5) 


For derivations and explanations of (3)-(5), the reader is re- 
ferred to [19]. The third term of (5) from [19] was neglected 
since its impact is small compared to the second term. Further- 
more, we have substituted the factor ag/,/2 from [19] by Ag. 
By means of (3)-(5) we can show that for a given frequency, 
maximum gain is achieved for a number of stages of 


1 


2A, (6) 


Nog = 


Circuit schematics of TWA with four stages, L = 170 pH, Rang = 75 2, Raba = 502, Cre shunt = 25 pF, Voi = 0.5 V, Vaa = 2 V. 
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Fig.6. Simulated gain using cascode (CC) and common source (CS) amplifier 
stages with different number of stages n. 


With the given device parameters and an operation frequency 
of 40 GHz, where according to the design goal optimum perfor- 
mance should be reached, we obtain A, = 0.0895, nog = 5.6, 
Go = 21 dB, and G = 15 dB. These calculations are appro- 
priate for first considerations and optimizations. Due to the ad- 
ditional losses generated by the drain line and the inductors, the 
total line losses will be slightly higher than assumed. Thus, in 
reality, the values of n,g and G are upper limits. 

Furthermore, ADS simulations using the more precise model 
presented in Fig. 1, and lumped equivalent circuits for the pas- 
sive devices [3], [4], were performed. 

In Fig. 6, the simulated gain versus frequency and number 
of cascode stages is shown. For a frequency up to 40 GHz, the 
simulation predict a n,qg of approximately 5, which is in good 
agreement with the theoretical calculations. Due to a more ac- 
curate consideration of the parasitics, the simulated power gain 
of 12 dB at 40 GHz is 3 dB lower than the calculated one. The 
results of a TWA with common-source stages is also included 
for comparison verifying the superior properties of the cascode 
circuit. 

The performance of the circuit is influenced by the induc- 
tors. In Fig. 7, the simulated gain versus frequency is illustrated 
for different inductor values. All relevant parasitics are consid- 
ered for the scalable inductor model. There are the following 
effects: with increasing inductor value, the capacitive parasitics 
of the FETs can be compensated improving the gain. How- 
ever, an increasing inductor value has two drawbacks. First, the 
series resistance of the inductor becomes large. Furthermore, 
above an associated resonance frequency, the gain drops sig- 
nificantly. Both effects degrade the maximum gain cutoff fre- 
quency. Therefore, an optimum inductor value has to be chosen. 
According to Fig. 7, a value of L = 170 nH is well suited for 
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Fig. 7. Simulated gain for different inductor values; parasitics are considered 
and scaled. 
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optimum gain performance up to 40 GHz. This value is slightly 
lower than the one found by the idealized calculations. 

Furthermore, the line termination resistors have a significant 
impact as depicted in Fig. 8..The gain toward low frequencies 
decreases with falling resistor values. Thus, the gain flatness and 
3-dB bandwidth can be improved. However, we will see later 
that an decreased input termination resistor degrades the noise 
performance. Consequently, a high R,», together with a low 
Rava is advantageous concerning an optimum tradeoff between 
3-dB bandwidth and noise. 

With the device-dependent drain and gate noise coefficients 
y and 6, respectively, the noise figure of FET TWAs can be 
approximated by [20] 


4y n-w*+O2,+Zo-6 
N+ Gm-* Zo 39m : 





F=1+ (7) 


The second term describes the drain noise, which is domi- 
nant at low frequencies, whereas the third term represents the 
frequency-dependent gate noise determining the high-frequency 
performance. Typically, values of 2/3 < y < landé = 4/3 are 
reported for long-channel devices [21]. Due to hot electron ef- 
fects, significantly higher drain noise currents and y coefficients 
are expected for short-channel devices as used in this work. By 
fitting of the measured noise figure, we obtain values of 7 = 2.2 


IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 2, FEBRUARY 2005 







Op 1 Os BO ener ap A 0 3 OO 
f [GHz] 


Fig. 9. Simulated noise figure of different TWAs with cascode and common 
source amplifier stages, n: number of stages. 
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Fig. 10. Simulated noise figure of four-stage cascode TWA with different gate 
line terminations. 


and 6 = 1.5, which is in very good agreement with the data ex- 
tracted for a single transistor [2]. From (7), a minimum noise 


figure of 
QwCin [476 
1+ —- (8) 
Im 3 


can be derived for a number of stages of 


24 37 
Moe aay Wie (9) 


At an operation frequency of 40 GHz, we obtain n,r = 3.7 and 
Ln = OvodD: 

For comparison, noise simulations were performed in ADS. 
As depicted in Fig. 9, up to 40 GHz, good noise performance is 
achieved for an. of approximately 4 verifying the theoretical 
results. Furthermore, the simulations show that the best low- 
frequency noise performance is achieved at high n,, whereas 
for high frequencies, the lowest noise figures are reached for low 
values of nr. In accordance to (7), this is attributed to the fact 
that the drain noise is inversely proportional to n,7, whereas the 
gate noise is proportional to nor. 

At low frequencies, a TWA behaves as a single transistor with 
all amplifier stages connected in parallel. Furthermore, the input 
and output are directly terminated by the absorption resistors. 
As clearly shown in Fig. 10, the gate line termination resistor 
Rape significantly increases the noise toward low frequencies. 
Thus, the low-frequency noise performance can be improved 
by increasing the input termination resistor. Unfortunately, this 
decreases the input return loss. A nominal value of Rapg = 75 Q 
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Fig. 11. Photograph of compact TWA MMIC with chip size of 0.89 mm 
x 0.33 mm. 
=) 
cA 
a —— Measured 
Simulated 
Theory 
0 10 20 30 40 50 60 70 80 
f [GHz] 
Fig. 12. Measured, simulated, and calculated gain. 


was chosen since this provides a reasonable tradeoff between 
enhanced noise performance and acceptable input return loss. 

Up to 40 GHz, the calculated values for n, ¢ and nog are close 
together. A value of n = 4 was used for the final realization of 
the TWA. The nominal value of Rapq is 50 2. 

A photograph of the compact TWA MMIC with overall chip 
size of 0.89 mm x 0.33 mm is shown in Fig. 11. To the author’s 
knowledge, this is the smallest chip size of a TWA reported to 
date. In mass fabrication, the small chip size scales down the 
costs. 


IV. RESULTS 


All measurements were performed on wafer, at source and 
load impedances of 50 Q and include the parasitics of the signal 
pads. The power consumption is Vgq = 2 V and Jgqg = 66 mA. 
As for the device characterization, S-parameters were measured 
using an HP 8510XF network analyzer. The noise figure setup 
consists of an HP 8970B noise figure meter, an HP 8971C test 
set extension and a external mixer allowing measurements up to 
40 GHz. An HP 436A power meter was used for determination 
of the compression point. 

The measured wafer was based on experimental hardware 
that showed process variations. The deviation of +60% for the 
termination resistors and the corresponding impact on the cir- 
cuit characteristics are significant and were considered in the 
following simulations. 

With a Rollet’s factor well above 1, the circuit is uncondition- 
ally stable. In Fig. 12, the measured, simulated and calculated 
gain is shown. A gain of 9.7 dB+1.6 dB was measured from 10 
to 59 GHz. Toward dc, the gain increases up to 16 dB. The gain 
cutoff frequency is 71 GHz. 

The measured, simulated and calculated noise figure of the 
circuit is shown in Fig. 13. Toward dc and between 23 and 
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29 GHz, the noise figure is approximately 3.2 dB. Up to 40 GHz, 
the noise figure is below 3.8 dB. To the author’s knowledge, 
these are the best results achieved for a silicon-based wideband 
amplifier operating up to millimeter-wave frequencies. Unfortu- 
nately, with our current measurement equipment, it is not pos- 
sible to characterize the noise figure at higher frequencies. 

In Fig. 14, the measured and simulated return losses are 
shown. From de to 60 GHz, the measured input and output 
return losses are higher than 5 and 12 dB. Higher return losses 
are expected for circuits from more nominal wafers. 

At 0.1, 20, and 40 GHz, the measured 1-dB output compres- 
sion points are 13.3, 12.5, and 9.5 dBm, corresponding to power 
added efficiencies of 16%, 13.5%, and 6.7%, respectively. The 
TWA was optimized as a low-noise amplifier. However, due the 
good large-signal performances, the circuit can also be used as a 
medium-power amplifier. The output power should be sufficient 
for many short-range WLAN systems. 


V. CONCLUSION 


The design and results of a low-noise CMOS TWA has 
been presented. Design tradeoffs and optimization guidelines 
for maximum operation frequency, gain, output power, and 
minimum noise have been discussed by means of analytical 
calculations and simulations. 

The circuit has been fabricated using 90-nm SOI technology 
and requires a chip area of only 0.3 mm?, which to the author’s 
knowledge is the smallest size reported for a TWA. The used 
technology is optimized for digital VLSI applications rather 








than for analog applications. Despite the restrictions of this tech- 
nology for analog circuits, excellent results have been achieved. 
From 0.1 to 59 GHz, the circuit has a gain above 8 dB. A very 
low noise figure of below 3.8 dB has been measured from 0.1 dB 
to 40 GHz. The author believes that this is best noise perfor- 
mance demonstrated for a silicon-based amplifier with compa- 
rable bandwidth. The achieved result is close to the one reported 
using leading-edge III/V technology. With a 1-dB output com- 
pression point of 13.3 to 9.5 dBm from 0.1 to 40 GHz, the circuit 
is also suited as a medium-power amplifier. 

Together with other works, this paper clearly shows the excel- 
lent suitability of VLSI SOI CMOS technology for analog cir- 
cuits at millimeter-wave frequencies, which not long ago were 
the exclusive domain of III/V technologies. This may lead to 
new market perspectives in areas such as WLAN, measurement 
equipment, and radar systems, since in the future, high data rates 
could be achieved at low costs. 
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Fractional- NV PLL With 1-Mb/s In-Loop Modulation” 
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Ian Galton, Member, IEEE 


A technique was presented in [1] that is similar to that presented in 
the above paper [2]. It was published shortly before the above paper 
went to press, and therefore should have been included as a reference 
in the above paper. 
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The first author of [1] has indicated that the topologies shown in 
Fig. 4 of the above paper [2] are the same as those described in [1], [3], 
and [4]. He has also stated that the means of detection of the direction 
of the wave described on page 2184 of [2] is the same as that in [4]. 
We regret the unintentional omission of these references. 
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Abstract—A sense amplifier (1300, 1500) is provided for sensing the state of 
a toggling type magnetoresistive random access memory (MRAM) cell without 
using a reference. The sense amplifier (1300, 1500) employs a sample-and-hold 
circuit (1336, 1508) combined with a current-to-voltage converter (1301, 1501), 
gain circuit (1303), and cross-coupled latch (1305, 1503) to sense the state of 
a bit. The sense amplifier (1300, 1500), first senses and holds a first state of 
the cell. The cell is toggled to a second state. Then, the sense amplifier (1300, 
1500) compares the first state to the second state to determine the first state of a 
toggling type memory cell. 
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Abstract—A delay locked loop circuit with a novel structure for improving 
a jitter performance is disclosed. The delay locked loop circuit includes a delay 
circuit for receiving an input clock signal and generating a delayed output clock 
signal. The delay circuit has a predetermined minimum variable delay, and the 
output clock signal is delayed with respect to the input clock signal by a delay to 
be determined in accordance with a delay control signal inputted into the delay 
circuit. Moreover, the delay locked loop circuit includes a phase determining 
block for receiving the input clock signal and the output clock signal, generating 
a phase pull signal when a phase of an input clock signal being delayed by a first 
predetermined time period leads a phase of the output clock signal, and gener- 
ating a phase push signal when a phase of the input clock signal lags behind a 
phase of a delayed output clock signal delayed by a second predetermined time, 
and a delay control circuit for generating the delay control signal for control- 
ling the delay circuit to reduce the delay when the phase pull signal is received 
from the phase determining block and to increase the delay when the phase push 
signal is received from the phase determining block. The delay control circuit 
does not change the delay of the delay circuit when neither the phase pull signal 
nor the phase push signal is received from the phase determining block. 
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Abstract—A nested transimpedance amplifier (TIA) circuit includes a zero- 
order TIA having an input and an output. A first operational amplifier (opamp) 
has an input that communicates with the output of the zero-order TIA and an 
output. A first feedback resistor has one end that communicates with the input 
of the zero-order TIA and an opposite end that communicates with the output 
of the first opamp. A capacitor has one end that communicates with the input of 
the zero-order TIA. The gain-bandwidth product of the nested TIA is increased. 
Differential mode TIA’s also have increased gain-bandwidth products. 
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Abstract—A technique is described to allow testing of high-speed digital cir- 
cuits using lower speed testing equipment, to circuits to be placed into a sleep 
mode, and to allow burn-in testing of digital circuits with minimal overhead in 
terms of silicon area or performance. 
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Abstract—In a feedback system such as a PLL, the integrating function asso- 
ciated with a loop filter capacitor is instead implemented digitally and is easily 
implemented on the same integrated circuit die as the PLL. There is no need for 
either an external loop filter capacitor nor for a large loop filter capacitor to be 
integrated on the same integrated circuit die as the PLL. In a preferred embod- 
iment, an analog phase detector is utilized whose phase error output signal is 
delta-sigma modulated to encode the magnitude of the phase error using a dig- 
ital (i.e., discrete-time and discrete-value) signal. This digital phase error signal 
is "integrated" by a digital integration block including, for example, a digital ac- 
cumulator, whose output is then converted to an analog signal, optionally com- 
bined with a loop feed-forward signal, and then conveyed as a control voltage 
to the voltage-controlled oscillator. The equivalent "size" of the integrating ca- 
pacitor function provided by the digital integration block may be varied by in- 
creasing or decreasing the bit resolution of circuits within the digital block. Con- 
sequently, an increasingly larger equivalent capacitor may be implemented by 
adding additional digital stages, each of which requires a small incremental in- 
tegrated circuit area. The power dissipation of the digital integration block is 
reduced by incorporating a decimation stage to reduce the required operating 
frequency of the remainder of the digital integration block. 
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Abstract—A novel and useful apparatus for and method of automatic gain 
control (AGC) using Kalman filtering and hysteresis. A nonlinear, time-variant 
loop filter such as a Kalman filter is employed in the feedback loop of an AGC 
circuit. The circuit is able to transition quickly and make fast adaptations to 
new levels of the input signal by use of a restart mechanism used to dynami- 
cally modify the gain of the loop filter thus enabling the AGC circuit to quickly 
adapt to changes in the signal level of the input. An AGC circuit incorporating 
a hysteresis circuit in the feedback loop is also disclosed. 
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Abstract—A circuit (100) is adapted for use in a radio frequency receiver 
and includes a transconductance amplifier (110), a direct digital frequency syn- 
thesizer (130), and a digital-to-analog converter (DAC) (120). The transcon- 
ductance amplifier (110) has an input terminal for receiving a radio frequency 
signal, and an output terminal for providing a current signal. The direct digital 
frequency synthesizer (130) has an output terminal for providing a digital local 
oscillator signal at a selected frequency. The DAC (120) has a first input ter- 
minal coupled to the output terminal of the transconductance amplifier (110), 
a second input terminal coupled to the output terminal of the direct digital fre- 
quency synthesizer (130), and an output terminal for providing an output signal. 
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Abstract—Analog-to-digital converter (ADC) structures and methods are 
provided that reduce an initial converter nonlinearity by introducing an inverse 
nonlinearity into the converter’s response that is substantially the inverse of the 
initial converter nonlinearity. In a pipelined ADC embodiment, for example, 
upstream converter stages are selected that generate an upstream digital code 
which defines sufficient upstream code words to designate respective segments 
of the inverse nonlinearity. In response to each of the upstream code words, 
the conversion gain of the remaining downstream converter stages is then suffi- 
ciently adjusted to insert the inverse nonlinearity into the converter response. 
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Abstract—Provided is a switched capacitor feedback circuit including two or 
more input ports configured to receive a corresponding a number of input signals 
and at least one output port. The output port is configured to output an adjusting 
signal. The input signals includes a number of primary signals and two or more 
reference signals that are associated with a first timing phase of operation. The 
adjusting signal is produced based upon a comparison between the primary sig- 
nals the reference signals. Also provided is a pair of active devices having gates 
coupled together and structured to receive the adjusting signal. The active de- 
vices are configured to provide a gain to the adjusting signal in accordance with 
a predetermined gain factor, and facilitate an adjustment to the number of pri- 
mary signals based upon the gain during a second timing phase of operation. 
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Abstract—A variable gain amplifier is configured of an amplification circuit 
and a control circuit controlling a gain of the amplification circuit. The ampli- 
fication circuit has first and second MOS transistors identical in characteristics 
and having respective sources connected to a first fixed potential. The amplifica- 
tion circuit has a differential gain proportional to a square root of a ratio between 
a current flowing through the first MOS transistor and a current flowing through 
the second MOS transistor. The control circuit applies a potential corresponding 
to a constant voltage plus a control voltage to a gate of the first MOS transistor 
and a potential corresponding to the constant voltage minus the control voltage 
to a gate of the second MOS transistor. 


100 


200 





567 


6,791,431 September 14, 2004 


Compact Balun With Rejection Filter for 802.11a and 
802.11b Simultaneous Operation 


Inventor: De Flaviis; Franco (Irvine, CA) 
Assignee: Broadcom Corporation (Irvine, CA) 
Filed: September 3, 2002. 


Current U.S. Class : 
Intern'l Class : 
Field of Search : 


333/26; 333/116 
HO1P 005/10 
333/25, 26, 116, 109, 112, 115 


References Cited 


U.S. Patent Documents 


4375699 Mar., 1983  Hallford 455/327. 
5455545 Oct.,1995 Garcia 333/26. 
6018277 Jan.,2000  Vaisanen 333/26. 
6300919 Oct.,2001 Mehen et al. 343/895. 
6515556 Feb.,2003 Kato et al. 333/116. 


Foreign Patent Documents 


2144985 Mar.,1973 EP 


Other References 


Piernes B. et al., “Improvement of the Design of 180 DEG 
Rat-Race Hybrid”, Electronics Letters, GB, vol. 36, 
No. 12, pp. 1035-1036 (Jun. 2000). 


Settaluri Raghu K. et al., “Compact Folded Line 
Rat-Race Hybrid Couplers”, IEEE Microwave Guided 
Wave Lett; [EEE Translations on Microwave and Guided 
Wave Letters, Feb. 2000, IEEE, vol. 10, No. 2, 

pp. 61-63. 


Settaluri R.K. et al., “Design of Compact Multi-Level 
Folded-Line RF Couplers”, 1999 IEEE MTT-S 
International Microwave Symposium Digest, 

IEEE Transactions on Microwave Theory and Techniques, 
vol. 47, No. 12, pp. 2331-2339 (Dec. 1999). 


Matsuura, H. et al., ” Monolithic Rat-Race Mixers for 
Millimeter Waves”, IEEE, pp. 101-104 (Jul. 2004). 
Johnson, K.M., ” X-Band Integrated Circuit Mixer with 
Reactively Terminated Image”, Transactions n Microwave 
Theory and Techniques, IEEE, vol. 16, No. 7, 

pp. 388-397 (Jul. 1968). 


Copy of European Search Report issued Jan. 13, 2004 for 
Appl. No. EP/03019915, 5 pages. 


Copy of European Search Report issued Jan. 13, 2004 for 
Appl. No. EP/03019914.4, 4 pages. 


Abstract—A balancing/unbalancing (balun) structure for operating at fre- 
quency f, includes a microstrip printed circuit board (PCB). A balun on the 
PCB includes two input ports are coupled to a differential signal. An isolated 
port is connected to ground through a matched resistance. An output port is cou- 
pled to a single-ended signal corresponding to the differential signal. A plurality 
of traces on the PCB connect the two input ports, the load connection port and a 
tap point to the output port. A f2 rejection filter on the PCB is wrapped around 
the balun and includes a first folded element with a transmission length of A2/4 





IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 2, FEBRUARY 2005 





and connected to the output port. A second folded element has a transmission Abstract—An integrated circuit including a Multi-Threshold CMOS 
length of \2/4 and connected to the tap point. A third folded element connects . (MTCMOS) latch combining low voltage threshold CMOS circuits with high | 
the tap point to the output port and has a transmission length of A2/4. voltage threshold CMOS circuits. The low voltage threshold circuits including 


a majority of the circuits in the signal path of the latch to ensure high perfor- 
mance of the latch. The latch further including high voltage threshold circuits to 
eliminate leakage paths from the low voltage threshold circuits when the latch | 
is in a sleep mode. A single-phase latch and a two-phase latch are provided. 
Each of the latches is implemented with master and slave registers. Data is ) 
held in either the master register or the slave register depending on the phase _ 

or phases of the clock signals. A multiplexer may alternatively be implemented 
prior to the master latch for controlling an input signal path during sleep and | 
active modes of the latch and for providing a second input signal path for test. 
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Abstract—Circuits and methods for a delta-sigma analog-to-digital converter 
having a variable oversample ratio to produce a constant fullscale output at re- 
duced circuit complexity, die area, and power dissipation are provided. The cir- 
cuits and methods consist of scaling the digital input to the digital filter with a 
decoder whose size depends on the number of oversample ratios allowed by the 
analog-to-digital converter. The digital filter is implemented as a comb filter 
having a cascade of N integrators and N differentiators, where N is the order 
of the digital filter. The size of the differentiators is equal to the number of bits 
used as output for the analog-to-digital converter, which is smaller than the size 
of the integrators and the number of bits produced by the digital filter. 


105 


110 415 120 












OVERSAMPLED 



















LOW-PASS L BITS 

Vv ANALOG DELTA- ted 

: Sant eee FLIER (ML) BITS 
MODULATOR UNUSED) 


F SAMPLE OSR FgaMPLE Fout 


6,801,099 October 5, 2004 


Methods for Bi-Directional Signaling 


Inventor: Stark; Donald C. (Los Altos Hills, CA) 
Assignee: Rambus Inc. (Los Altos, CA) 
Filed: July 16, 2003. 


Current U.S. Class : 333/130; 324/329; 324/759; 
324/765; 370/282 

GO1R 031/26; H03H 007/38 
333/130, 324/765, 329, 759 370/282 


Intern'l Class : 
Field of Search : 


References Cited 


U.S. Patent Documents 


5719856 Feb.,1998 May 
6452428 Sep.,2002 Mooney et al. 


370/282. 
327/108. 


Abstract—Improved methods and apparatuses are provided for conducting 
bi-directional signaling and testing. The outputs of at least two driver circuits 
are connected to a resistive network. The output signals from the driver circuits 
are combined through the resistive network to produce a resultant signal that is 
an attenuated version of at least one of the output signals. The resistive network 
and the driver circuits are configured such that the resultant signal is provided 
to an output node of the resistive network but not to an input node of the re- 
sistive network. An input/output node of an external circuit is connected to the 
input node of the resistive network, wherein the external circuit is configured 
to receive the resultant signal and output an external signal. An input node of 
a receiver circuit is connected to the output node of the resistive network. The 
resultant signal is then simultaneously provided to the external circuit and the 
external signal to the receiver circuit, bi-directionally through the resistive net- 
work. 
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Abstract—A method and system is arranged to convert a differential low- 
voltage input signal (e.g. LVDS or RSDS) into a single-ended output signal. An 
operational transconductance amplifier (OTA) is configured to convert the input 
signal into a current. A transimpedance ‘stage is configured to convert the cur- 
rent into the single-ended output signal. The voltage associated with the output 
of the OTA corresponds to approximately VDD/2. The transimpedance stage 
comprises an inverter circuit, a p-type transistor, and an n-type transistor. The 
transistors are arranged in a negative feedback configuration with the inverter. 
The single-ended output signal has a voltage swing that approximately corre- 
sponds to the sum of the V.sub.GS of the n-type transistor and the V.sub.GS of 
the p-type transistor. The output signal may be buffered by additional circuits 
such as an inverter, a Schmitt, as well as others. 


100 
10- ae 





6,806,744 October 19, 2004 


High Speed Low Voltage Differential to 
Rail-to-Rail Single Ended Converter 


Inventors: Bell; Marshall J. (Chandler, AZ), Cooper; David B. (Chan- 

dler, AZ), and Kozisek; James (Fort Collins, CO). 
Assignee: National Semiconductor Corporation (Santa Clara, CA) 
Filed: October 3, 2003. 





Current U.S. Class: 327/70; 327/53; 327/65 ARNO 787, wana 
widiat eek 397/52 fen ae * Varactor Folding Technique for Phase Noise 
Ae ; pe eee Reduction in Electronic Oscillators 
Inventors: Gomez; Ramon Alejandro (San Juan, CA), Burns; 
Lawrence M. (Luguna Mills, CA), and Kral; Alexandre 
(Laguna Niguel, CA). 
References Cited Assignee: Broadcom Corporation (Irvine, CA) 
Filed: March 25, 2003. 
U.S. Patent Documents 
4539489 Sep.1985 Vaughn 327/206. 
6429735 Aug.,2002 Kuoetal. 327/563. Current U.S. Class: 331/179; 331/117FE; 331/175; 
6433602 Aug.,2002 Lalletal. 327/205. 331/177V 
6512400 Jan.,2003 Forbes 327/66. Intern'l Class : H03B 005/08; HO3B 005/12 
Field of Search : 331/36 C, 117 R, 117 FE, 117 D, 


Other References 175,177 R, 177 V,179 





IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 2, FEBRUARY 2005 


References Cited 


Leeson, “A Simple Model of Feedback Oscillator Noise 
Spectrum, ” Proceedings of the IEEE—— Frequency 
Stability, manuscript received : Dec. 10, 1965, 

revised : Dec. 29, 1965, Feb., 1996, pp. 329-330. 
“Oscillators,” RF Microelectronics, Chapter 7, 
Prentice Hall PTR 1998. 

“Microwave and Wireless Synthesizers—— Theory 
, John Wiley & Sons, Inc. 1997. 
Copy of International Search Report For International 
Application no. PCT/US00/34095, filed Dec. 14, 2000. 
“A General Theory of Phase Noise in 

” eo Journal of Solid-State 
Circuits, vol. 33, no. 2, Feb., 1998, 179-194. 

Kral et al., “RF-CMOS Oscillators with Switched 
Tuning, ” IEEE 1998 Custom Integrated Circuits 
Conference, May 11-14, 1998, pp. 26.1.1-26.1.4, 

pp. 555-558. 


Ravazi, 
pp. 206-246, 
Rohde, 


and Design,” pp. 567-572 


Hajimiri et al., 
Electrical Oscillators, 


Abstract—A varactor folding technique reduces noise in controllable elec- 
tronic oscillators through the use of a series of varactors having relatively small 
capacitance. A folding circuit provides control signals to the varactors in a se- 
quential manner to provide a relatively smooth change in the total capacitance of 
the oscillator. Consequently, effective control of the oscillator is achieved with 
accompanying reductions in oscillator noise such as flicker noise. 
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Abstract—A nonvolatile reprogrammable switch for use in a PLD or FPGA 
has a nonvolatile memory cell connected to the gate of an MOS transistor, which 
is in a well, with the terminals of the MOS transistor connected to the source of 
the signal and to the circuit. The nonvolatile memory cell is of a split gate type 
having a first region and a second region, with a channel therebetween. The cell 
has a floating gate positioned over a first portion of the channel, which is ad- 
jacent to the first region and a control gate positioned over a second portion of 
the channel, which is adjacent to the second region. The second region is con- 
nected to the gate of the MOS transistor. The cell is programmed by injecting 
electrons from the channel onto the floating gate by hot electron injection mech- 
anism. The cell is erased by Fowler-Nordheim tunneling of the electrons from 
the floating gate to the control gate. As a result, no high voltage is ever applied 
to the second region during program or erase. In addition, a MOS FET transistor 
has a terminal connected to the well, and another end to a voltage source, with 
the gate connected to the nonvolatile memory cell. The switch also has a cir- 
cuit element connecting the gate of the MOS transistor to a voltage source. The 
threshold voltage of the well can be dynamically changed by turning on/off the 
MOS FET transistor. 
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Abstract—A VCO (110) can be configured to convert an analog input signal 
(105) to a digital output signal (125). In accordance with the inventive arrange- 
ments, the VCO can convert the analog input signal to at least one intermediate 
signal (130) having a frequency dependent on the analog input signal. A fre- 





IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 40, NO. 2, FEBRUARY 2005 


quency detector (115) can be configured to determine a frequency of at least 
one intermediate signal. Subsequently, a mapping circuit (120) can be config- 
ured to map the determined frequency of the at least one intermediate signal to 
an output value representing the digital output signal (125). 
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